Presentation is loading. Please wait.

Presentation is loading. Please wait.

AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.

Similar presentations


Presentation on theme: "AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D."— Presentation transcript:

1 AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.

2 AutoEditor Base-calling in the context of single chromatogram is hard… but finding base-calling “mistakes” in a multiple alignment is easy.

3 Principal and secondary aims of AutoEditor AutoEditor as a higher level base caller Tiling discrepancy types Base caller error types Resolving discrepancies of the form B…B* Resolving discrepancies of the form *…*B AutoEditor statistics

4 A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types. A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.

5 autoEditor as a higher level base caller single read trace data base caller nucleotide sequence tiling of reads tiling discrepanciesmultiple read trace data autoEditor list of corrected discrepancies

6 Other applications: Clear range editing (read expansion) SNP detection

7 Clear range editing single read quality values data trimming algorithm trimmed read less stringently trimmed reads assembler tiling of reads autoEditor

8 SNP detection Alignment data of genome 1 Alignment data of genome 2 Combined genomes alignment data List of putative SNPs autoEditor List of putative SNPs that pass autoEditor error screening

9 Tiling discrepancy types Single deletion: Single insertion:

10 Single insertion and single deletion are extreme cases of insertion/deletion discrepancies A AAA A AA* AA** A*** **** The above sequence of discrepancies can be represented schematically as an edge in a two vertex graph: A*

11 The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex A T C G *

12 support support (b) amplitude (a) minimum difference between amplitude and local minimum (c) Open dots on the signal curve indicate local maxima and open circles indicate local minima. Re-calling individual bases

13 Base caller error types Missed signal Signal shift Unresolved peaks

14 Resolving a single deletion discrepancy compute discrepancy’s read multiplicity: mult if mult = 0 then check for a missed signal error if |mult| > 0 then check for a signal shift error if it is not a signal shift error then it is a unresolved peaks error To resolve it, find two other reads with well resolved peaks over the unresolved peaks bases A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.

15

16 Resolving a single insertion discrepancy compute discrepancy’s read multiplicity - mult if mult = 0 then check if the signal parameters are within allowable ranges if | mult | > 0 then check if the insertion base is a part of |mult |+1 well- resolved signal peaks if not find two other reads whose traces have exactly |mult | well- resolved signal peaks between the bases flanking the discrepancy position

17 mult = 0, weak signal error mult = -2, unresolved peaks error with two other reads with exactly 2 signal peaks between Gs flanking AA*

18 from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(kb)# corrections # autoEdit # errors in errors newer autoEdit 1 1321243 0 2 64784 1 3 40553 0 4 53452 1 5 16150 0 6 22291 0 7 23190 0 8 51481 0 9 26331 0 10 15150 0 ---------------------------------------------------------------------- Total: 44246115 2 ~3.25% ~0.43% Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1

19 Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(in kb)#disc#corr%corr 1 132 3390326696% 2 642195214298% 3 401344132599% 4 531304124295% 5 1650848796% 6 2277775797% 7 2362461398% 8 511303123295% 9 2678376097% 10 1543742397% -------------------------------------------------------------------- Total: 442126651206595% where #disc is the total number of discrepancies in the given contig #corr is the number of corrected discrepancies %corr is the percentage of corrected discrepancies AutoEditor version 1.2 correcting all single deletion errors

20 AutoEditor accuracy

21 AutoEditor accuracy


Download ppt "AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D."

Similar presentations


Ads by Google