Download presentation
Presentation is loading. Please wait.
Published byVincent Brown Modified over 8 years ago
1
kGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko
2
Introduction Introduction Reconstructing spectrum of viral population Challenges: – Assembling short reads to span entire genome – Distinguishing sequencing errors from mutations Avoid assembling: – ID sequences via high variability region
3
Previous Work Previous Work KEC (k-mer Error Correction) [Skums et al.] – Incorporates counts (frequencies) of k-mers (substrings of length k) QuasiRecomb (Quasispecies Recombination) [Töpfer et. al] – Hidden Markov Model-based approach – Incorporates possibility for recombinant progeny – Parameter: k generators (ancestor haplotypes)
4
Problem Formulation Problem Formulation Given: a set of reads R emitted by a set of unknown haplotypes H’ Find: a set of haplotypes H = {H 1,…,H k } maximizing Pr(R|H)
5
Fractional Haplotype Fractional Haplotype Fractional Haplotype: a string of 5-tuples of probabilities for each possible symbol: a, c, t, g, d=‘-’ a c - tctgc a0.710.060.00.130.00.270.100.03 c0.130.940.0 0.640.00.140.58 t0.160.00.010.870.110.730.00.09 g0.0 0.210.00.250.00.760.09 d0.0 0.780.0 0.21
6
kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes
7
Initialization Initialization Find set of reads representing haplotype population – Start with a random read – Each next read maximizes minimum distance to previously chosen 1 2 3 4
8
Initialization Initialization Transform selected reads into fractional haplotypes using formula: where s m is i-th nucleotide of selected read s. a c - tg- g a -c ε=0.01 a0.960.01 0.960.01 c 0.960.01 0.96 t0.01 0.960.01 g 0.960.010.960.01 d 0.960.01 0.960.01 0.960.01
9
Read Emission Probability Read Emission Probability For each i=1, …, k and for each read r j from R compute value: 1 2 3 2 1 Reads Haplotypes h 1,1 h 3,2 h 2,1 h 3,1 h 1,2 h 2,2
10
Estimate Frequencies Estimate Frequencies Estimate haplotype frequencies via Expectation Maximization (EM) method Repeat two steps until the change < σ E-step: expected portion of r emitted by H i M-step: updated frequency of haplotype H i
11
Update Haplotypes Update Haplotypes Update allele frequencies for each haplotype according to read’s contribution: a0.710.060.00.130.00.27 … 0.100.03 c0.130.940.0 0.640.00.140.58 t0.160.00.010.870.110.730.00.09 g0.0 0.210.00.250.00.760.09 d0.0 0.780.0 0.21
12
Round each haplotype’s position to most probable allele a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.960.01 0.960.01 … c 0.960.01 0.960.01 0.96 t0.01 0.960.01 0.960.01 g 0.960.01 d 0.960.01 Round Haplotypes Round Haplotypes ac-tactgc
13
Collapse and Drop Rare Collapse and Drop Rare Collapse haplotypes which have the same integral strings Drop haplotypes with coverage ≤ δ – Empirically, δ<5 implies drop in PPV without improving sensitivity
14
kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes
15
Experimental Setup Experimental Setup HCV E1E2 sub-region (315bp) 20 simulated data sets of 10 variants 100,000 reads from Grinder 0.5 10 datasets with homo-polymer errors Frequency distribution: uniform and power-law model with parameter α= 2.0
17
Nicholas Mancuso Alex Zelikovsky Pavel Skums Ion M ă ndoiu Acknowledgements Acknowledgements
18
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.