AMOS Assembly Validation and Visualization Michael Schatz October 7, 2008
Lander-Waterman Statistics Sequencing as a random process E(#contigs) = Ne-cσ E(contig size) = L[((ecσ – 1)/c) - (1- σ)] L = read length T = minimum overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L
Assembly Reality Contigs are never as large as expected High coverage is a necessary but not sufficient condition Sequencing is basically random, but sequence composition is not Repeats control the quality of the assembly Assemblers break contigs at ambiguous repeats Highly repetitive genomes will be highly fragmented Assemblers make mistakes Mis-assemblies confuse all downstream analysis Tension between overlap error rate and repeat resolution
Assembly Evaluation
Assembly Evaluation
Genome Assembly Forensics Computationally scan an assembly for mis-assemblies. Data inconsistencies are indicators for mis-assembly Some inconsistencies are merely statistical variations amosvalidate Load Assembly Data into Bank Analyze Mate Pairs & Libraries Analyze Depth of Coverage Analyze Normalized K-mers Analyze Read Alignments Analyze Read Breakpoints Load Mis-assembly Signatures into Bank AMOS Bank
Mis-assembly types Compression A R B Expansion A R B Correct Rearrangement A R B C D Correct Mis-assembly Inversion A R C B Basic mis-assemblies can be combined into more complicated patterns: Insertions, Deletions, Giant Hairballs
1. Analyze Mate Pairs Evaluate mate “happiness” across assembly Happy = Correct orientation and distance Finds regions with multiple: Compressed Mates Expanded Mates Invalid orientation (same or outie ) Missing Mates Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig)
Mate-Happiness: Deletion Deletion: Excise reads between collapsed repeats Truth Misassembly: Compressed Mates, Linking Mates
Mate-Happiness: Insertion Insertion: Additional reads between flanking repeats Truth Misassembly: Expanded Mates, Missing Mates
Mate-Happiness: Inversion Inversion: Flipping of reads Truth Misassembly: Misoriented Mates
Mate-Happiness: Rearrangement Rearrangement: Reordering of reads Truth Misassembly: Misoriented, Compressed, Stretched Mates
C/E Statistic Easy to detect mis-oriented or missing mates, but individual compressed or expanded mates are expected. Does the distribution of inserts spanning a given position differ from the rest of the library? Record large differences as potential misassemblies Even if each individual mate is “happy” Compute the statistic at all positions (Local Mean – Global Mean) / (Global Stdev / √N) > +3 indicates significant expansion < -3 indicates significant compression Introduced by Jim Yorke’s group at UMD
Sampling the Genome 2kb 4kb 6kb 8 inserts: 3kb-6kb Local Mean: 4048 C/E Stat: (4048-4000) = +0.33 (400 / √8) Near 0 indicates overall happiness 0kb
C/E-Statistic: Expansion 0kb 2kb 4kb 6kb 8 inserts: 3.2kb-6kb Local Mean: 4461 C/E Stat: (4461-4000) = +3.26 (400 / √8) C/E Stat ≥ 3.0 indicates Expansion
C/E-Statistic: Compression 0kb 2kb 4kb 6kb 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: (3488-4000) = -3.62 (400 / √8) C/E Stat ≤ -3.0 indicates Compression
2. Read Coverage Find regions of contigs where the depth of coverage is unusually high Collapsed Repeat Signature Can detect collapse of 100% identical repeats AMOS Tool: analyzeReadDepth 2.5x mean coverage A R1 B R2 A R1 + R2 B
3. Normalized K-mers Not all repeats are mis-assembled, but most mis-assemblies occur at repeats. Detect repeats by noticing increased read k-mer coverage. 2 copy repeat will have ~2x the average read k-mer coverage N copy repeat will have ~Nx the average read k-mer coverage Normalize read k-mer coverage with consensus k-mer coverage to highlight just the repeats with “missing” copies. The sequence of an N copy repeat should occur N times in the consensus.
Normalized K-mers Correct assembly of 2 copy repeat A R1 B R2 Read K-mers Consensus K-mers Normalized K-mers
Normalized K-mers Mis-assembly of 2 copy repeat Effective even if reads are left as singletons A B R1 + R2 Read K-mers Consensus K-mers Normalized K-mers
4. Read Alignment Multiple reads with same conflicting base are unlikely 1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error Regions of correlated SNPs are likely to be assembly errors or interesting biological events Highly specific metric AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds: Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error) A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A
5. Read Breakpoints Align singleton reads to consensus sequences. A consistent breakpoint shared by multiple reads can indicate a collapsed repeat. Initially developed to detect collapsed repeat in Bacillus Anthracis. 375 BAPDN53TF 786bp 665 BAPDF83TF 786bp 428 BAPCM37TR 697bp 668 BAPBW17TR 1049bp 144337 16S rRNA 146944 144203 145515 146226 147021 RA RB
Performance Combining signatures into suspicious regions greatly improves specificity.
Hawkeye
Hawkeye Goals Interactively explore and analyze Libraries Insert Sizes, Read Length, Inserts Scaffolds & Contigs Sizes, Composition, Sequence Multiple Alignment, SNP Barcode Read Coverage, k-mer Coverage Inserts Happiness, Coverage, CE Statistic Reads Clear Range, Quality Values, Chromatograms Features Arbitrary regions of interest Including Mis-assembly Signatures!!! Overview Zoom & Filter Details on Demand
Launch Pad
Histograms & Statistics Insert Size Read Length GC Content Overall Statistics Bird’s eye view of data and assembly quality
Scaffold View Statistical Plots Scaffold Features Inserts Overview Control Panel Details
Insert Happiness Happy Stretched Compressed Both mates present Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched Insert Size > Library.mean + Happy-Distance * Library.sd Compressed Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented Same or Outies Linking Read’s mate is in some other scaffold Singleton Read’s mate is a singleton Unmated No mate was provided for read Both mates present Only 1 read present
Standard Feature Types [B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features Loading Features: $ loadFeatures bankname featfile Featfile format: Contigid type end5 end3 comment
Contig View Consensus & Position Scrollable Read Tiling Read Orientation Discrepancy Highlight Summary Navigation Contig Quick Select Regular Expression Consensus Search
Contig View Expanded Quality Values Normalized Chromatogram Chromatograms are loaded from specified directories, or on demand from Trace Archive.
Misassembly Walkthough: Assembly Reports Misassembly Walkthough: Correlated SNPs Contigs Features Reads Scaffolds Full Integration: “Double click takes you there”
SNP View Zoom Out SNP Sorted Reads Polymorphism View
SNP Barcode SNP Sorted Reads Colored Rectangle indicate the positions and composition of the SNPs
Scaffold View Coverage CE Statistic SNP Feature Happy Stretched Compressed Misoriented Linking
Collapsed Repeat Read Coverage Spike -5.5 CE Dip Compressed Mates Cluster 68 Correlated SNPs
Confirmed Misassembly Fixed Collapsed repeat Compressed mates (-5.5 CE Stat) Correlated SNPs (68 Positions within 1400bp) Spike in Read Coverage
Fixing the collapsed repeat Select reads and mates in region of collapse. AMOS: findMissingMates, select-reads Reassemble those reads with a stricter unitigger error rate. AMOS: minimus Patch the collapsed region of the original assembly with corrected version. AMOS: stitchContigs
stitchContigs Before After The multiple alignment of the patch replaces the multiple alignment between the two stitch reads. Otherwise, reads left of the left stitch read and right of the right stitch read are taken directly from the original master contig in their original multiple alignment, so the master contig and the stitched contig are identical except near the compression point.
More Information AMOS Webpages: Contact AMOS Acknowledgements http://amos.sourceforge.net/forensics http://amos.sourceforge.net/hawkeye Contact AMOS amos-help [ at ] lists.sourceforge.net Acknowledgements Adam Phillippy Ben Shneiderman Steven Salzberg Art Delcher Mihai Pop