AMOS Assembly Validation and Visualization

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Huong Le Department of Molecular & Clinical Genetics, Royal Prince Alfred Hospital Click mouse to move to the next slide.
Visualising and Exploring BS-Seq Data
Genome Assembly: a brief introduction
Click to edit Master title style Irys data analysis January 10 th, 2014.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
NGS Analysis Using Galaxy
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
Identification of Copy Number Variants using Genome Graphs
Human Genome.
De novo assembly validation
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Review for MassHunter and reporting
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Virginia Commonwealth University
Canadian Bioinformatics Workshops
Introduction to Bioinformatics Resources for DNA Barcoding
Lesson: Sequence processing
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Frequency of Nonallelic Homologous Recombination Is Correlated with Length of Homology: Evidence that Ectopic Synapsis Precedes Ectopic Crossing-Over 
Henrik Lantz - NBIS/SciLife/Uppsala University
Jin Zhang, Jiayin Wang and Yufeng Wu
MapView: visualization of short reads alignment on a desktop computer
How to Build a Horse: Final Report
Discovery tools for human genetic variations
Exploring and Understanding ChIP-Seq data
Visualising and Exploring BS-Seq Data
Volume 10, Issue 5, Pages (May 2018)
Presentation transcript:

AMOS Assembly Validation and Visualization Michael Schatz October 7, 2008

Lander-Waterman Statistics Sequencing as a random process E(#contigs) = Ne-cσ E(contig size) = L[((ecσ – 1)/c) - (1- σ)] L = read length T = minimum overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L

Assembly Reality Contigs are never as large as expected High coverage is a necessary but not sufficient condition Sequencing is basically random, but sequence composition is not Repeats control the quality of the assembly Assemblers break contigs at ambiguous repeats Highly repetitive genomes will be highly fragmented Assemblers make mistakes Mis-assemblies confuse all downstream analysis Tension between overlap error rate and repeat resolution

Assembly Evaluation

Assembly Evaluation

Genome Assembly Forensics Computationally scan an assembly for mis-assemblies. Data inconsistencies are indicators for mis-assembly Some inconsistencies are merely statistical variations amosvalidate Load Assembly Data into Bank Analyze Mate Pairs & Libraries Analyze Depth of Coverage Analyze Normalized K-mers Analyze Read Alignments Analyze Read Breakpoints Load Mis-assembly Signatures into Bank AMOS Bank

Mis-assembly types Compression A R B Expansion A R B Correct Rearrangement A R B C D Correct Mis-assembly Inversion A R C B Basic mis-assemblies can be combined into more complicated patterns: Insertions, Deletions, Giant Hairballs

1. Analyze Mate Pairs Evaluate mate “happiness” across assembly Happy = Correct orientation and distance Finds regions with multiple: Compressed Mates Expanded Mates Invalid orientation (same   or outie  ) Missing Mates Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig)

Mate-Happiness: Deletion Deletion: Excise reads between collapsed repeats Truth Misassembly: Compressed Mates, Linking Mates

Mate-Happiness: Insertion Insertion: Additional reads between flanking repeats Truth Misassembly: Expanded Mates, Missing Mates

Mate-Happiness: Inversion Inversion: Flipping of reads Truth Misassembly: Misoriented Mates

Mate-Happiness: Rearrangement Rearrangement: Reordering of reads Truth Misassembly: Misoriented, Compressed, Stretched Mates

C/E Statistic Easy to detect mis-oriented or missing mates, but individual compressed or expanded mates are expected. Does the distribution of inserts spanning a given position differ from the rest of the library? Record large differences as potential misassemblies Even if each individual mate is “happy” Compute the statistic at all positions (Local Mean – Global Mean) / (Global Stdev / √N) > +3 indicates significant expansion < -3 indicates significant compression Introduced by Jim Yorke’s group at UMD

Sampling the Genome 2kb 4kb 6kb 8 inserts: 3kb-6kb Local Mean: 4048 C/E Stat: (4048-4000) = +0.33 (400 / √8) Near 0 indicates overall happiness 0kb

C/E-Statistic: Expansion 0kb 2kb 4kb 6kb 8 inserts: 3.2kb-6kb Local Mean: 4461 C/E Stat: (4461-4000) = +3.26 (400 / √8) C/E Stat ≥ 3.0 indicates Expansion

C/E-Statistic: Compression 0kb 2kb 4kb 6kb 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: (3488-4000) = -3.62 (400 / √8) C/E Stat ≤ -3.0 indicates Compression

2. Read Coverage Find regions of contigs where the depth of coverage is unusually high Collapsed Repeat Signature Can detect collapse of 100% identical repeats AMOS Tool: analyzeReadDepth 2.5x mean coverage A R1 B R2 A R1 + R2 B

3. Normalized K-mers Not all repeats are mis-assembled, but most mis-assemblies occur at repeats. Detect repeats by noticing increased read k-mer coverage. 2 copy repeat will have ~2x the average read k-mer coverage N copy repeat will have ~Nx the average read k-mer coverage Normalize read k-mer coverage with consensus k-mer coverage to highlight just the repeats with “missing” copies. The sequence of an N copy repeat should occur N times in the consensus.

Normalized K-mers Correct assembly of 2 copy repeat A R1 B R2 Read K-mers Consensus K-mers Normalized K-mers

Normalized K-mers Mis-assembly of 2 copy repeat Effective even if reads are left as singletons A B R1 + R2 Read K-mers Consensus K-mers Normalized K-mers

4. Read Alignment Multiple reads with same conflicting base are unlikely 1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error Regions of correlated SNPs are likely to be assembly errors or interesting biological events Highly specific metric AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds: Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error) A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A

5. Read Breakpoints Align singleton reads to consensus sequences. A consistent breakpoint shared by multiple reads can indicate a collapsed repeat. Initially developed to detect collapsed repeat in Bacillus Anthracis. 375 BAPDN53TF 786bp 665 BAPDF83TF 786bp 428 BAPCM37TR 697bp 668 BAPBW17TR 1049bp 144337 16S rRNA 146944 144203 145515 146226 147021 RA RB

Performance Combining signatures into suspicious regions greatly improves specificity.

Hawkeye

Hawkeye Goals Interactively explore and analyze Libraries Insert Sizes, Read Length, Inserts Scaffolds & Contigs Sizes, Composition, Sequence Multiple Alignment, SNP Barcode Read Coverage, k-mer Coverage Inserts Happiness, Coverage, CE Statistic Reads Clear Range, Quality Values, Chromatograms Features Arbitrary regions of interest Including Mis-assembly Signatures!!! Overview Zoom & Filter Details on Demand

Launch Pad

Histograms & Statistics Insert Size Read Length GC Content Overall Statistics Bird’s eye view of data and assembly quality

Scaffold View Statistical Plots Scaffold Features Inserts Overview Control Panel Details

Insert Happiness Happy Stretched Compressed Both mates present Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched Insert Size > Library.mean + Happy-Distance * Library.sd Compressed Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented Same or Outies Linking Read’s mate is in some other scaffold Singleton Read’s mate is a singleton Unmated No mate was provided for read Both mates present Only 1 read present

Standard Feature Types [B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features Loading Features: $ loadFeatures bankname featfile Featfile format: Contigid type end5 end3 comment

Contig View Consensus & Position Scrollable Read Tiling Read Orientation Discrepancy Highlight Summary Navigation Contig Quick Select Regular Expression Consensus Search

Contig View Expanded Quality Values Normalized Chromatogram Chromatograms are loaded from specified directories, or on demand from Trace Archive.

Misassembly Walkthough: Assembly Reports Misassembly Walkthough: Correlated SNPs Contigs Features Reads Scaffolds Full Integration: “Double click takes you there”

SNP View Zoom Out SNP Sorted Reads Polymorphism View

SNP Barcode SNP Sorted Reads Colored Rectangle indicate the positions and composition of the SNPs

Scaffold View Coverage CE Statistic SNP Feature Happy Stretched Compressed Misoriented Linking

Collapsed Repeat Read Coverage Spike -5.5 CE Dip Compressed Mates Cluster 68 Correlated SNPs

Confirmed Misassembly Fixed Collapsed repeat Compressed mates (-5.5 CE Stat) Correlated SNPs (68 Positions within 1400bp) Spike in Read Coverage

Fixing the collapsed repeat Select reads and mates in region of collapse. AMOS: findMissingMates, select-reads Reassemble those reads with a stricter unitigger error rate. AMOS: minimus Patch the collapsed region of the original assembly with corrected version. AMOS: stitchContigs

stitchContigs Before After The multiple alignment of the patch replaces the multiple alignment between the two stitch reads. Otherwise, reads left of the left stitch read and right of the right stitch read are taken directly from the original master contig in their original multiple alignment, so the master contig and the stitched contig are identical except near the compression point.

More Information AMOS Webpages: Contact AMOS Acknowledgements http://amos.sourceforge.net/forensics http://amos.sourceforge.net/hawkeye Contact AMOS amos-help [ at ] lists.sourceforge.net Acknowledgements Adam Phillippy Ben Shneiderman Steven Salzberg Art Delcher Mihai Pop