Presentation is loading. Please wait.

Presentation is loading. Please wait.

Imputation 2 Presenter: Ka-Kit Lam.

Similar presentations


Presentation on theme: "Imputation 2 Presenter: Ka-Kit Lam."— Presentation transcript:

1 Imputation 2 Presenter: Ka-Kit Lam

2 Outline Big Picture and Motivation IMPUTE IMPUTE2 Experiments
Conclusion and Discussion Supplementary : GWAS Estimate on mutation rate

3 Big Picture and Motivation

4 Background Genome-wide association study:
Identify common genetic factors that influence health/disease

5 Background Important to know the SNPs However, . . . ,
Not all SNPs are genotyped for all individuals in the case-control study in GWAS. How can we guess the missing parts? Individual 1: ACCCAATTACCAGTATTTA… Individual 2: CCCCATTTACCACTATTTA… Individual 3: ACCCATTTACCACTATTTA… Individual 4: CCCCATTTACCAGTATTTA… ?

6 Information known Luckily, we now have references for human DNA:
But, how can we use the reference genomes?

7 Main Question Objective: Design algorithms Criteria for algorithms
to impute the missing genotypes of the individuals being studied Criteria for algorithms Scalable Accurate

8 Big Picture on Algorithm Design
SNPs in study, reference haplotype/genotype Imputed genotype, associated confidence Algorithms In theory, it makes sense In practice, it works Scalability Accuracy 1. Experimental validation 2. Application

9 IMPUTE

10 Notations and Setting Reference Haplotypes : 1 N L
1 N L Genotype in the study sample: ? 2 1 K L (Rmk: 0-00 , 1-01, 2-11)

11 Formulation Observed genotype and missing genotype
Classical inference problem: A reasonable estimate: Confidence:

12 Modeling (HMM model): Relationship btw (H,G)
Assumptions: Study individuals are independent Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model Mutation at different sites are conditionally independent given the copied haplotype

13 Modeling (HMM model): Relationship btw (H,G)
Reference Haplotypes : N 1 L Study Individual: ? 2 2 1

14 Modeling (HMM model): Relationship btw (H,G)
1 2 1 …

15 Modeling (Transition Probability)
States Transition What is the intuition?

16 Modeling :relationship btw transition Probability and Recombination
Recombination Process:

17 Modeling :relationship btw transition Probability and Recombination
Recombination Process: More reference, longer the copy length Copy length in our model depends on genetic distance btw SNPs Ref panel 1 Ref panel 2 Study individual: More likely to have longer copy length here

18 Modeling (Transition Probability)
States Transition

19 Modeling (Emission Probability)
Define mutation rate : Since mutation is assumed independent across site 0-00 1-01 2 -11 00 (1-λ)2 2λ(1-λ) (λ)2 01 λ(1-λ) (λ)2+(1-λ)2 11

20 Extension (completely missing)
Problem: Missing genotype across all references and study samples. How to impute? What can we expect? Generate information from no information? We cannot expect to know the genotype But we can guess the relationship btw them Our friend : population genetics may help ! 1 ?

21 Imputation on Reference
Illustration H(1) 1 ? H(2) H (3) H (4) H(N) 1 1

22 Imputation on Reference
Algorithm: 1. Randomly select an ordering 2. Sample the first mutation according to 3. Treat previous as references and impute 4. Repeat several time to get a stable output 5. Use the imputed reference to impute the study

23 Computational Complexity: Imputation
… O(N2L) for each individual

24 Computational Complexity: Imputation
O(N2L) for each individual

25 Computational Complexity: Forward-Backward Algorithm
Forward Equations: Naïve application takes O(N4)

26 Computational Complexity: Forward-Backward Algorithm
Q : How to compute the following in O(N2) ? A: (suggested in fastPhase)

27 Computational Complexity: Forward-Backward Algorithm
Finally, we have Similarly for the backward part O(N2) O(N2) totally O(N) for each j O(N2) totally O(N) for each i O(N2) totally

28 Demo ./impute -h example/haplo.txt -l example/legend.txt
-g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int Demo

29 Demo

30 IMPUTE2

31 Motivation Accuracy: Complexity: New data type:
Not all information used during imputation (e.g. other study individuals) Complexity: Need to scale well if we incorporate all information (e.g. previously it is O(LN2)) New data type: Diploid reference (1000 genome project) Q: How to design algorithms to handle this?

32 Description of Setting(Scenario A)
Reference Haplotypes : 1 Nhap L Genotype in the inference panel: ? 2 1 Ninf L :T, :U (Rmk : sets of index of SNPs) (Rmk: 0-00 , 1-01, 2-11)

33 Description of Setting(Scenario B)
Reference Haplotypes : 1 1 2 1 2 Nhap L Diploid reference panel ? 2 1 Ndip 2 ? 1 Ninf Inference panel L :T, :U1 , :U2 (Rmk : sets of index of SNPs) (Rmk: 0-00 , 1-01, 2-11)

34 Algorithm for Scenario A
Illustration: 1 ? 2 1

35 Algorithm for Scenario A
Illustration (Burn in) 1 00 ? 11 10

36 Algorithm for Scenario A
Illustration (Phasing) 1 00 ? 11 10 1 ? Update i (1) (0) (genotype) ?

37 Algorithm for Scenario A
Illustration (Imputing) 1 00 ? 11 10 ? ? 1 Update i (1) (0) (genotype) 1

38 Phasing Step: Path Sampling
… How to sample path?

39 Imputation Step: Extract Posterior Probability
After many rounds, we can get : For each individual and for each missing site Assuming independence in sampling the haploid pair Hap 1 1 0.3 0.7 0.2 0.8 … Hap 2 1 0.1 0.9 0.4 0.6 … Genotype 1 2 0.03 0.34 0.63 0.08 0.44 0.48 … Take average then

40 Algorithm for Scenario A: Complexity Analysis
A) Burn in phase B) MCMC iterations for m times: For each individual i i) phase(i,T,hap+inf) ii) impute(i,T+U,hap) iii) record(posterior probability) C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT) O(NhapLT+U) O(LT+U) Missing in T also need to be imputec

41 Benefits of the Algorithm
Faster: Reducing the load in the imputation step More accurate: Utilize information available to guess

42 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: 1 Nhap Ninf Ndip ? 2 1 2 ? 1

43 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Burn in ) 1 Nhap Ninf Ndip 00 ? 11 10 11 ? 00 10

44 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Phase T and U2 in diploid ref) 1 Nhap Ninf Ndip 00 ? 11 ? ? 10 ? 11 00 Update i 11 ? 00 10

45 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Impute U1 in diploid ref) 1 Nhap Ninf Ndip 00 1 11 10 ? 10 11 00 Update i 11 ? 00 10

46 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Phase T in inference panel) 1 Nhap Ninf Ndip 00 1 11 10 ? 10 11 00 11 ? ? 00 10 ? 10 ? 00 Update i

47 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Impute U2 in inference panel) 1 Nhap Ninf Ndip 00 1 11 10 ? 10 11 00 11 ? 00 10 10 ? 1 00 Update i

48 Algorithm for Scenario B
:T, :U1 , :U2 Illustration: (Impute U1 in inference panel) 1 Nhap Ninf Ndip 00 1 11 10 ? Need not match blue 10 11 00 11 ? 00 10 1 10 1 00 Update i

49 Algorithm for Scenario B: Complexity Analysis
A) Burn in phase B) MCMC iterations for m times: For each individual i in dip: i) phase(i,T+U2,hap+dip) ii) impute(i,T+U1,hap) Iii) record(posterior probability) For each individual i in inference : i) phase(i,T,hap+dip+inf) ii) impute(i,T+U2,hap+dip) iii) impute(i,U1, hap) iv) record(posterior probability) C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT+U2) O(NhapLT+U1) O(LT+U1) O((Nhap + Ndip + Ninf)2LT) O(Nhap+dipLT+U2) Missing in T also need to be imputec O(NhapLU1) O(LT+U1+U2)

50 Benefits of the Algorithm
Able to handle new data type Faster and more accurate

51 Further Speeding Up Choose k closest neighours in phasing
Need to compute Hamming distance O(k2L) for HMM but O(NL) for Hamming distance computation (better than O(N2L) in previous HMM calculation) Choose khap closest neighbours in imputation Khap >> k is also good (because O(k2) in phasing but O(k) in imputation) Mark down the number 50 vs 250

52 Comparison with Beagle
Weakness of BEAGLE: Full joint modeling of all individuals Accuracy decreases when population increases /number of SNPs increases in the experiments Less accurate in rare SNPs than IMPUTE2 More memory efficient Strength of BEAGLE: Faster Better accommodate trio and duos

53 Demo ./impute2 \  -m ./Example/example.chr22.map \  -h ./Example/example.chr22.1kG.haps \  -l ./Example/example.chr22.1kG.legend \  -g ./Example/example.chr22.study.gens \  -strand_g ./Example/example.chr22.study.strand \  -int 20.4e6 20.5e6 \  -Ne 20000 \  -o ./Example/example.chr22.one.phased.impute2

54 Experiments

55 Experiment plans Evaluation of the performance of imputation:
Accuracy Time and space complexity Comparison with other methods Application of imputation Identification of associated SNPs in GWAS Optimizing performance Effect of multiple reference panels

56 Accuracy and Calibration
Setting: Mask the known genotype Impute using IMPUTE Compare called base with ground-truth Calling Threshold: by genotype by SNPs Measure % missing and % mismatch for different threshold Compare the estimated confidence with the experimental confidence

57 Accuracy and Calibration
%mismatch %missing Message: IMPUTE is reasonably accurate and is well calibrated

58 Comparison: Accuracy (in general and rare allele)
The more to lower left the better Message: IMPUTE2 is accurate , especially in rare allele

59 Comparison: Algorithm Complexity (Time and Space Complexity)
Phasing step: shorter L Imputation step: linear in N Multiple MCMC increases time Message: IMPUTE2 is not too bad in terms of time and space complexity

60 Application 1: Identification of associated SNPs
Setting: Uses case and control set to identify the gene associated with Type II Diabetes Use filtered genotype and that have MAP > 1% Evaluate the P-value and plot against the chromosome position to identify the causal gene Useful in Identifying SNPs to follow up Assessing strength of signal

61 Application 1: Identification of associated SNPs
Red: Imputed SNPs Black: typed SNPs Message: IMPUTE helps identifying SNPs associated with phenotype

62 Application 2: Validation of missing data
Setting: Some genotype collected are not very reliable Use imputation to impute the genotype by assuming it is missing Call and compare to the original genotype

63 Application 2: Validation of missing data
BB ? AB AA Message: IMPUTE helps reassuring the confidence of data

64 Effect of Reference Set

65 Effect of reference set
Motivation: Capture low-frequency variants by incorporating data among populations Remain computationally efficient Setting: Pearson correlation for accuracy Varying Khap Adding more references Khap approach

66 Effect of Reference Set
Improvement get saturated when we have enough references Saturating khap and also saturating panel; They optimize over khap to a reasonable value Improvement get saturated when khap reach a certain threshold Message: More reference set improves accuracy and IMPUTE2 facilitate this

67 Summary IMPUTE, IMPUTE2 and their extensions
They attempt to design algorithms for imputation based on Population genetics model HMM computation Extensive experiments suggests that IMPUTE2 is reasonably accurate and can make good use of reference data set available for GWAS.

68 Discussion Parameters in HMM: Completely missing SNPs: Trios: Speed:
Can they learn the parameters of copying process from the study data through EM algorithm? Completely missing SNPs: Can they use clustering algorithm in imputing completely missing data? Trios: Can they use different panels to do the imputation? Speed: Can they preprocess the reference to speed up the computation? Can the ideas of BEAGLE of merging come into place at some part of pre-HMM computation?

69 Supplementary : GWAS

70 Genetic Architecture Why are we interested in imputation?
For GWAS. Domain of interest:

71 Case-Control Study and Bayes Factor
1 2 Cases s0 s1 s2 Control r0 r1 r2 Distribution of prior theta is known

72 Supplementary : Reverse Engineering the per site mutation probability

73 Review of Population Genetics
Wright Fisher Model for coalescence : Infinite site model for mutation At every inheritance, there is a probability u of mutation. And mutation occurs only at a distinct site never happened in history. 2M individuals Generate next generation by randomly choosing with replacement from the last generation and copy

74 Relationship btw Coalescent Theory and Imputation
Our question: Having a sample of N individuals as references What is the mutation rate(per site) λ btw study sample and the nearest neighbor in the N references N references Nearest neighbor in references study Whole population (2M)

75 Estimation of Mutation Rate λ
Pr(no coalescence between the study and all references in last t generations) Average time to coalescence Thus, mutation rate is λ B A N references study Time t

76 Estimation of Mutation Rate λ
Time t Estimate u Estimate λ t2 t3 t4 N references λ

77 References Marchini et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet (2007) vol. 39 (7) pp Howie et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet (2009) vol. 5 (6) pp. e Howie et al. Genotype imputation with thousands of genomes. G3 (Bethesda) (2011) vol. 1 (6) pp Marchini and Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet (2010) vol. 11 (7) pp   R. Durrett. Probability Models for DNA Sequence Evolution. Springer, 2nd ed., 2008 N. Li and M. Stephens. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data. Genetics, 165:2213–2233, 2003.

78 Thank you


Download ppt "Imputation 2 Presenter: Ka-Kit Lam."

Similar presentations


Ads by Google