Download presentation
Presentation is loading. Please wait.
Published byAlan Lamb Modified over 9 years ago
1
AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
2
AutoEditor Base-calling in the context of single chromatogram is hard… but finding base-calling “mistakes” in a multiple alignment is easy.
3
Principal and secondary aims of AutoEditor AutoEditor as a higher level base caller Tiling discrepancy types Base caller error types Resolving discrepancies of the form B…B* Resolving discrepancies of the form *…*B AutoEditor statistics
4
A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types. A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.
5
autoEditor as a higher level base caller single read trace data base caller nucleotide sequence tiling of reads tiling discrepanciesmultiple read trace data autoEditor list of corrected discrepancies
6
Other applications: Clear range editing (read expansion) SNP detection
7
Clear range editing single read quality values data trimming algorithm trimmed read less stringently trimmed reads assembler tiling of reads autoEditor
8
SNP detection Alignment data of genome 1 Alignment data of genome 2 Combined genomes alignment data List of putative SNPs autoEditor List of putative SNPs that pass autoEditor error screening
9
Tiling discrepancy types Single deletion: Single insertion:
10
Single insertion and single deletion are extreme cases of insertion/deletion discrepancies A AAA A AA* AA** A*** **** The above sequence of discrepancies can be represented schematically as an edge in a two vertex graph: A*
11
The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex A T C G *
12
support support (b) amplitude (a) minimum difference between amplitude and local minimum (c) Open dots on the signal curve indicate local maxima and open circles indicate local minima. Re-calling individual bases
13
Base caller error types Missed signal Signal shift Unresolved peaks
14
Resolving a single deletion discrepancy compute discrepancy’s read multiplicity: mult if mult = 0 then check for a missed signal error if |mult| > 0 then check for a signal shift error if it is not a signal shift error then it is a unresolved peaks error To resolve it, find two other reads with well resolved peaks over the unresolved peaks bases A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.
16
Resolving a single insertion discrepancy compute discrepancy’s read multiplicity - mult if mult = 0 then check if the signal parameters are within allowable ranges if | mult | > 0 then check if the insertion base is a part of |mult |+1 well- resolved signal peaks if not find two other reads whose traces have exactly |mult | well- resolved signal peaks between the bases flanking the discrepancy position
17
mult = 0, weak signal error mult = -2, unresolved peaks error with two other reads with exactly 2 signal peaks between Gs flanking AA*
18
from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(kb)# corrections # autoEdit # errors in errors newer autoEdit 1 1321243 0 2 64784 1 3 40553 0 4 53452 1 5 16150 0 6 22291 0 7 23190 0 8 51481 0 9 26331 0 10 15150 0 ---------------------------------------------------------------------- Total: 44246115 2 ~3.25% ~0.43% Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1
19
Test set: the first 10 contigs of Mycoplasma arthritidis asmbl_id size(in kb)#disc#corr%corr 1 132 3390326696% 2 642195214298% 3 401344132599% 4 531304124295% 5 1650848796% 6 2277775797% 7 2362461398% 8 511303123295% 9 2678376097% 10 1543742397% -------------------------------------------------------------------- Total: 442126651206595% where #disc is the total number of discrepancies in the given contig #corr is the number of corrected discrepancies %corr is the percentage of corrected discrepancies AutoEditor version 1.2 correcting all single deletion errors
20
AutoEditor accuracy
21
AutoEditor accuracy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.