Download presentation
Presentation is loading. Please wait.
1
AMOS Assembly Validation and Visualization
Michael Schatz October 7, 2008
2
Lander-Waterman Statistics
Sequencing as a random process E(#contigs) = Ne-cσ E(contig size) = L[((ecσ – 1)/c) - (1- σ)] L = read length T = minimum overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L
3
Assembly Reality Contigs are never as large as expected
High coverage is a necessary but not sufficient condition Sequencing is basically random, but sequence composition is not Repeats control the quality of the assembly Assemblers break contigs at ambiguous repeats Highly repetitive genomes will be highly fragmented Assemblers make mistakes Mis-assemblies confuse all downstream analysis Tension between overlap error rate and repeat resolution
4
Assembly Evaluation
5
Assembly Evaluation
6
Genome Assembly Forensics
Computationally scan an assembly for mis-assemblies. Data inconsistencies are indicators for mis-assembly Some inconsistencies are merely statistical variations amosvalidate Load Assembly Data into Bank Analyze Mate Pairs & Libraries Analyze Depth of Coverage Analyze Normalized K-mers Analyze Read Alignments Analyze Read Breakpoints Load Mis-assembly Signatures into Bank AMOS Bank
7
Mis-assembly types Compression A R B Expansion A R B Correct
Rearrangement A R B C D Correct Mis-assembly Inversion A R C B Basic mis-assemblies can be combined into more complicated patterns: Insertions, Deletions, Giant Hairballs
8
1. Analyze Mate Pairs Evaluate mate “happiness” across assembly
Happy = Correct orientation and distance Finds regions with multiple: Compressed Mates Expanded Mates Invalid orientation (same or outie ) Missing Mates Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig)
9
Mate-Happiness: Deletion
Deletion: Excise reads between collapsed repeats Truth Misassembly: Compressed Mates, Linking Mates
10
Mate-Happiness: Insertion
Insertion: Additional reads between flanking repeats Truth Misassembly: Expanded Mates, Missing Mates
11
Mate-Happiness: Inversion
Inversion: Flipping of reads Truth Misassembly: Misoriented Mates
12
Mate-Happiness: Rearrangement
Rearrangement: Reordering of reads Truth Misassembly: Misoriented, Compressed, Stretched Mates
13
C/E Statistic Easy to detect mis-oriented or missing mates, but individual compressed or expanded mates are expected. Does the distribution of inserts spanning a given position differ from the rest of the library? Record large differences as potential misassemblies Even if each individual mate is “happy” Compute the statistic at all positions (Local Mean – Global Mean) / (Global Stdev / √N) > +3 indicates significant expansion < -3 indicates significant compression Introduced by Jim Yorke’s group at UMD
14
Sampling the Genome 2kb 4kb 6kb 8 inserts: 3kb-6kb Local Mean: 4048
C/E Stat: ( ) = +0.33 (400 / √8) Near 0 indicates overall happiness 0kb
15
C/E-Statistic: Expansion
0kb 2kb 4kb 6kb 8 inserts: 3.2kb-6kb Local Mean: 4461 C/E Stat: ( ) = +3.26 (400 / √8) C/E Stat ≥ 3.0 indicates Expansion
16
C/E-Statistic: Compression
0kb 2kb 4kb 6kb 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: ( ) = -3.62 (400 / √8) C/E Stat ≤ -3.0 indicates Compression
17
2. Read Coverage Find regions of contigs where the depth of coverage is unusually high Collapsed Repeat Signature Can detect collapse of 100% identical repeats AMOS Tool: analyzeReadDepth 2.5x mean coverage A R1 B R2 A R1 + R2 B
18
3. Normalized K-mers Not all repeats are mis-assembled, but most mis-assemblies occur at repeats. Detect repeats by noticing increased read k-mer coverage. 2 copy repeat will have ~2x the average read k-mer coverage N copy repeat will have ~Nx the average read k-mer coverage Normalize read k-mer coverage with consensus k-mer coverage to highlight just the repeats with “missing” copies. The sequence of an N copy repeat should occur N times in the consensus.
19
Normalized K-mers Correct assembly of 2 copy repeat A R1 B R2
Read K-mers Consensus K-mers Normalized K-mers
20
Normalized K-mers Mis-assembly of 2 copy repeat
Effective even if reads are left as singletons A B R1 + R2 Read K-mers Consensus K-mers Normalized K-mers
21
4. Read Alignment Multiple reads with same conflicting base are unlikely 1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error Regions of correlated SNPs are likely to be assembly errors or interesting biological events Highly specific metric AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds: Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error) A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A
22
5. Read Breakpoints Align singleton reads to consensus sequences. A consistent breakpoint shared by multiple reads can indicate a collapsed repeat. Initially developed to detect collapsed repeat in Bacillus Anthracis. 375 BAPDN53TF 786bp 665 BAPDF83TF bp 428 BAPCM37TR 697bp 668 BAPBW17TR 1049bp 144337 16S rRNA 146944 144203 145515 146226 147021 RA RB
23
Performance Combining signatures into suspicious regions greatly improves specificity.
24
Hawkeye
25
Hawkeye Goals Interactively explore and analyze Libraries
Insert Sizes, Read Length, Inserts Scaffolds & Contigs Sizes, Composition, Sequence Multiple Alignment, SNP Barcode Read Coverage, k-mer Coverage Inserts Happiness, Coverage, CE Statistic Reads Clear Range, Quality Values, Chromatograms Features Arbitrary regions of interest Including Mis-assembly Signatures!!! Overview Zoom & Filter Details on Demand
26
Launch Pad
27
Histograms & Statistics
Insert Size Read Length GC Content Overall Statistics Bird’s eye view of data and assembly quality
28
Scaffold View Statistical Plots Scaffold Features Inserts Overview
Control Panel Details
29
Insert Happiness Happy Stretched Compressed Both mates present
Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched Insert Size > Library.mean + Happy-Distance * Library.sd Compressed Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented Same or Outies Linking Read’s mate is in some other scaffold Singleton Read’s mate is a singleton Unmated No mate was provided for read Both mates present Only 1 read present
30
Standard Feature Types
[B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features Loading Features: $ loadFeatures bankname featfile Featfile format: Contigid type end5 end3 comment
31
Contig View Consensus & Position Scrollable Read Tiling
Read Orientation Discrepancy Highlight Summary Navigation Contig Quick Select Regular Expression Consensus Search
32
Contig View Expanded Quality Values Normalized Chromatogram
Chromatograms are loaded from specified directories, or on demand from Trace Archive.
33
Misassembly Walkthough:
Assembly Reports Misassembly Walkthough: Correlated SNPs Contigs Features Reads Scaffolds Full Integration: “Double click takes you there”
34
SNP View Zoom Out SNP Sorted Reads Polymorphism View
35
SNP Barcode SNP Sorted Reads
Colored Rectangle indicate the positions and composition of the SNPs
36
Scaffold View Coverage CE Statistic SNP Feature Happy Stretched
Compressed Misoriented Linking
37
Collapsed Repeat Read Coverage Spike -5.5 CE Dip Compressed Mates
Cluster 68 Correlated SNPs
38
Confirmed Misassembly
Fixed Collapsed repeat Compressed mates (-5.5 CE Stat) Correlated SNPs (68 Positions within 1400bp) Spike in Read Coverage
39
Fixing the collapsed repeat
Select reads and mates in region of collapse. AMOS: findMissingMates, select-reads Reassemble those reads with a stricter unitigger error rate. AMOS: minimus Patch the collapsed region of the original assembly with corrected version. AMOS: stitchContigs
40
stitchContigs Before After
The multiple alignment of the patch replaces the multiple alignment between the two stitch reads. Otherwise, reads left of the left stitch read and right of the right stitch read are taken directly from the original master contig in their original multiple alignment, so the master contig and the stitched contig are identical except near the compression point.
41
More Information AMOS Webpages: Contact AMOS Acknowledgements
Contact AMOS amos-help [ at ] lists.sourceforge.net Acknowledgements Adam Phillippy Ben Shneiderman Steven Salzberg Art Delcher Mihai Pop
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.