Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMOS Assembly Validation and Visualization

Similar presentations


Presentation on theme: "AMOS Assembly Validation and Visualization"— Presentation transcript:

1 AMOS Assembly Validation and Visualization
Michael Schatz October 7, 2008

2 Lander-Waterman Statistics
Sequencing as a random process E(#contigs) = Ne-cσ E(contig size) = L[((ecσ – 1)/c) - (1- σ)] L = read length T = minimum overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L

3 Assembly Reality Contigs are never as large as expected
High coverage is a necessary but not sufficient condition Sequencing is basically random, but sequence composition is not Repeats control the quality of the assembly Assemblers break contigs at ambiguous repeats Highly repetitive genomes will be highly fragmented Assemblers make mistakes Mis-assemblies confuse all downstream analysis Tension between overlap error rate and repeat resolution

4 Assembly Evaluation

5 Assembly Evaluation

6 Genome Assembly Forensics
Computationally scan an assembly for mis-assemblies. Data inconsistencies are indicators for mis-assembly Some inconsistencies are merely statistical variations amosvalidate Load Assembly Data into Bank Analyze Mate Pairs & Libraries Analyze Depth of Coverage Analyze Normalized K-mers Analyze Read Alignments Analyze Read Breakpoints Load Mis-assembly Signatures into Bank AMOS Bank

7 Mis-assembly types Compression A R B Expansion A R B Correct
Rearrangement A R B C D Correct Mis-assembly Inversion A R C B Basic mis-assemblies can be combined into more complicated patterns: Insertions, Deletions, Giant Hairballs

8 1. Analyze Mate Pairs Evaluate mate “happiness” across assembly
Happy = Correct orientation and distance Finds regions with multiple: Compressed Mates Expanded Mates Invalid orientation (same   or outie  ) Missing Mates Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig)

9 Mate-Happiness: Deletion
Deletion: Excise reads between collapsed repeats Truth Misassembly: Compressed Mates, Linking Mates

10 Mate-Happiness: Insertion
Insertion: Additional reads between flanking repeats Truth Misassembly: Expanded Mates, Missing Mates

11 Mate-Happiness: Inversion
Inversion: Flipping of reads Truth Misassembly: Misoriented Mates

12 Mate-Happiness: Rearrangement
Rearrangement: Reordering of reads Truth Misassembly: Misoriented, Compressed, Stretched Mates

13 C/E Statistic Easy to detect mis-oriented or missing mates, but individual compressed or expanded mates are expected. Does the distribution of inserts spanning a given position differ from the rest of the library? Record large differences as potential misassemblies Even if each individual mate is “happy” Compute the statistic at all positions (Local Mean – Global Mean) / (Global Stdev / √N) > +3 indicates significant expansion < -3 indicates significant compression Introduced by Jim Yorke’s group at UMD

14 Sampling the Genome 2kb 4kb 6kb 8 inserts: 3kb-6kb Local Mean: 4048
C/E Stat: ( ) = +0.33 (400 / √8) Near 0 indicates overall happiness 0kb

15 C/E-Statistic: Expansion
0kb 2kb 4kb 6kb 8 inserts: 3.2kb-6kb Local Mean: 4461 C/E Stat: ( ) = +3.26 (400 / √8) C/E Stat ≥ 3.0 indicates Expansion

16 C/E-Statistic: Compression
0kb 2kb 4kb 6kb 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: ( ) = -3.62 (400 / √8) C/E Stat ≤ -3.0 indicates Compression

17 2. Read Coverage Find regions of contigs where the depth of coverage is unusually high Collapsed Repeat Signature Can detect collapse of 100% identical repeats AMOS Tool: analyzeReadDepth 2.5x mean coverage A R1 B R2 A R1 + R2 B

18 3. Normalized K-mers Not all repeats are mis-assembled, but most mis-assemblies occur at repeats. Detect repeats by noticing increased read k-mer coverage. 2 copy repeat will have ~2x the average read k-mer coverage N copy repeat will have ~Nx the average read k-mer coverage Normalize read k-mer coverage with consensus k-mer coverage to highlight just the repeats with “missing” copies. The sequence of an N copy repeat should occur N times in the consensus.

19 Normalized K-mers Correct assembly of 2 copy repeat A R1 B R2
Read K-mers Consensus K-mers Normalized K-mers

20 Normalized K-mers Mis-assembly of 2 copy repeat
Effective even if reads are left as singletons A B R1 + R2 Read K-mers Consensus K-mers Normalized K-mers

21 4. Read Alignment Multiple reads with same conflicting base are unlikely 1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error Regions of correlated SNPs are likely to be assembly errors or interesting biological events Highly specific metric AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds: Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error) A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A

22 5. Read Breakpoints Align singleton reads to consensus sequences. A consistent breakpoint shared by multiple reads can indicate a collapsed repeat. Initially developed to detect collapsed repeat in Bacillus Anthracis. 375 BAPDN53TF 786bp 665 BAPDF83TF bp 428 BAPCM37TR 697bp 668 BAPBW17TR 1049bp 144337 16S rRNA 146944 144203 145515 146226 147021 RA RB

23 Performance Combining signatures into suspicious regions greatly improves specificity.

24 Hawkeye

25 Hawkeye Goals Interactively explore and analyze Libraries
Insert Sizes, Read Length, Inserts Scaffolds & Contigs Sizes, Composition, Sequence Multiple Alignment, SNP Barcode Read Coverage, k-mer Coverage Inserts Happiness, Coverage, CE Statistic Reads Clear Range, Quality Values, Chromatograms Features Arbitrary regions of interest Including Mis-assembly Signatures!!! Overview Zoom & Filter Details on Demand

26 Launch Pad

27 Histograms & Statistics
Insert Size Read Length GC Content Overall Statistics Bird’s eye view of data and assembly quality

28 Scaffold View Statistical Plots Scaffold Features Inserts Overview
Control Panel Details

29 Insert Happiness Happy Stretched Compressed Both mates present
Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched Insert Size > Library.mean + Happy-Distance * Library.sd Compressed Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented Same or Outies Linking Read’s mate is in some other scaffold Singleton Read’s mate is a singleton Unmated No mate was provided for read Both mates present Only 1 read present

30 Standard Feature Types
[B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features Loading Features: $ loadFeatures bankname featfile Featfile format: Contigid type end5 end3 comment

31 Contig View Consensus & Position Scrollable Read Tiling
Read Orientation Discrepancy Highlight Summary Navigation Contig Quick Select Regular Expression Consensus Search

32 Contig View Expanded Quality Values Normalized Chromatogram
Chromatograms are loaded from specified directories, or on demand from Trace Archive.

33 Misassembly Walkthough:
Assembly Reports Misassembly Walkthough: Correlated SNPs Contigs Features Reads Scaffolds Full Integration: “Double click takes you there”

34 SNP View Zoom Out SNP Sorted Reads Polymorphism View

35 SNP Barcode SNP Sorted Reads
Colored Rectangle indicate the positions and composition of the SNPs

36 Scaffold View Coverage CE Statistic SNP Feature Happy Stretched
Compressed Misoriented Linking

37 Collapsed Repeat Read Coverage Spike -5.5 CE Dip Compressed Mates
Cluster 68 Correlated SNPs

38 Confirmed Misassembly
Fixed Collapsed repeat Compressed mates (-5.5 CE Stat) Correlated SNPs (68 Positions within 1400bp) Spike in Read Coverage

39 Fixing the collapsed repeat
Select reads and mates in region of collapse. AMOS: findMissingMates, select-reads Reassemble those reads with a stricter unitigger error rate. AMOS: minimus Patch the collapsed region of the original assembly with corrected version. AMOS: stitchContigs

40 stitchContigs Before After
The multiple alignment of the patch replaces the multiple alignment between the two stitch reads. Otherwise, reads left of the left stitch read and right of the right stitch read are taken directly from the original master contig in their original multiple alignment, so the master contig and the stitched contig are identical except near the compression point.

41 More Information AMOS Webpages: Contact AMOS Acknowledgements
Contact AMOS amos-help [ at ] lists.sourceforge.net Acknowledgements Adam Phillippy Ben Shneiderman Steven Salzberg Art Delcher Mihai Pop


Download ppt "AMOS Assembly Validation and Visualization"

Similar presentations


Ads by Google