AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
AutoEditor Base-calling in the context of single chromatogram is hard… but finding base-calling “mistakes” in a multiple alignment is easy.
Principal and secondary aims of AutoEditor AutoEditor as a higher level base caller Tiling discrepancy types Base caller error types Resolving discrepancies of the form B…B* Resolving discrepancies of the form *…*B AutoEditor statistics
A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types. A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.
autoEditor as a higher level base caller single read trace data base caller nucleotide sequence tiling of reads tiling discrepanciesmultiple read trace data autoEditor list of corrected discrepancies
Other applications: Clear range editing (read expansion) SNP detection
Clear range editing single read quality values data trimming algorithm trimmed read less stringently trimmed reads assembler tiling of reads autoEditor
SNP detection Alignment data of genome 1 Alignment data of genome 2 Combined genomes alignment data List of putative SNPs autoEditor List of putative SNPs that pass autoEditor error screening
Tiling discrepancy types Single deletion: Single insertion:
Single insertion and single deletion are extreme cases of insertion/deletion discrepancies A AAA A AA* AA** A*** **** The above sequence of discrepancies can be represented schematically as an edge in a two vertex graph: A*
The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex A T C G *
support support (b) amplitude (a) minimum difference between amplitude and local minimum (c) Open dots on the signal curve indicate local maxima and open circles indicate local minima. Re-calling individual bases
Base caller error types Missed signal Signal shift Unresolved peaks
Resolving a single deletion discrepancy compute discrepancy’s read multiplicity: mult if mult = 0 then check for a missed signal error if |mult| > 0 then check for a signal shift error if it is not a signal shift error then it is a unresolved peaks error To resolve it, find two other reads with well resolved peaks over the unresolved peaks bases A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.
Resolving a single insertion discrepancy compute discrepancy’s read multiplicity - mult if mult = 0 then check if the signal parameters are within allowable ranges if | mult | > 0 then check if the insertion base is a part of |mult |+1 well- resolved signal peaks if not find two other reads whose traces have exactly |mult | well- resolved signal peaks between the bases flanking the discrepancy position
mult = 0, weak signal error mult = -2, unresolved peaks error with two other reads with exactly 2 signal peaks between Gs flanking AA*
from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(kb)# corrections # autoEdit # errors in errors newer autoEdit Total: ~3.25% ~0.43% Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1
Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(in kb)#disc#corr%corr % % % % % % % % % % Total: % where #disc is the total number of discrepancies in the given contig #corr is the number of corrected discrepancies %corr is the percentage of corrected discrepancies AutoEditor version 1.2 correcting all single deletion errors
AutoEditor accuracy
AutoEditor accuracy