Genome Research 12:1 (2002), 177-189. Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
DNA Sequencing and Assembly
DNA Sequencing.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Whole Genome Assembly Microarray analysis. Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing a genome and Basic Sequence Alignment
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Genome sequencing Haixu Tang School of Informatics.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
billion-piece genome puzzle
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Presented By: Chinua Umoja
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Denovo genome assembly of Moniliophthora roreri
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Research in Computational Molecular Biology , Vol (2008)
Finishing the human genome sequence?
Removing Erroneous Connections
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Genome Research 12:1 (2002),

Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments ● Identification of paired pairs ● Contig assembly ● Identification of repeat contigs ● Creation of scaffolds ● Filling gaps in scaffolds ● Consensus computation

Trimming ● find longest contiguous sequence with error less than 5% (use quality values) ● trim further if any base with Q<10 is within 12 bases of either end ● throw away read if length < 50 after trimming ● identify vector by aligning with E. coli and known cloning vector sequences ● remove vector from beginning and/or end of read

Overlapping ● 24-mer indexing ● index only 1/2 of all k-mers – for (x1,x2) where x1 is the reverse compl of x2, store whichever k-mer is alphabetically first ● exclude high-copy k-mers ● create read pairs for all reads that share one or more k-mers

Error correction in reads ● Correct errors using multiple alignment TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 C: 20 C: 35 C: 0 C: 35 C: 40 ● Score alignments ● Accept alignments with good scores A: 15 A: 25 A: 40 A: 25 - A: 15 A: 25 A: 40 A: 25 A: 0

Evaluation of alignments (pairs) ● Penalty (P) for each mismatching base is minimum of: – quality scores of the two aligned bases – quality scores of the bases on their immediate left and right ● Penalty score is then 10 P/10 ● Discard pairs with penalty score > 100

Evaluation of alignments (pairs) ● Example: A A G T G T C T A A A G T G C C T A ● P = min(10,30) because of T-C mismatch ● Penalty score is 10 P/10 = 1

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 2 Using paired pairs of overlaps to merge reads

Contig assembly ● Paired pairs form the initial contigs ● Next, mark repeat boundaries before doing further merging ● Only merge read pairs when they do not cross a repeat boundary

Serafim Batzoglou et al. Genome Res. 2002; 12: A: merging across a repeat boundary may cause mis- assembly. Here, A may be assembled next to D. B: a potential repeat boundary identified by the divergence of reads x and y, both of which overlap r. C: Contigs on left and right are created by merging reads up to a repeat boundary. The repeat region would also create a contig, whose coverage would be twice as deep. D: Sequence errors may cause artificial breaks. Read r “dominates” read y because the neighbors of y are all neighbors of r. “Dominated” reads are eliminated.

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 4 Detection of repeat contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such that A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B (if their reads do not overlap), then R is probably a repeat linking to two unique regions to the right.

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 5 Scaffold creation and gap filling these are usually repetitive contigs

Simulation of WGS Data ● Reads selected from a target genome in random locations ● Errors created using realistic quality values and errors from real reads taken from finished BACs done at MIT/Whitehead ● No cloning bias (no large gaps in coverage) ● No long stretches of low quality within reads ● Two data sets created, at 10.3X and 5.15X ● Two libraries, 4kb and 40kb, with 20:1 ratio

Making a Simulated Read Simulated reads have error patterns taken from random real reads ERRORIZER Simulated read artificial shotgun read real read

Human 22, Results of Simulations Plasmid/ Cosmid cov 10 X / 0.5 X 5 X / 0.5 X 3 X/ 0 X N50 contig353 Kb15 Kb2.7 Kb Mean contig142 Kb10.6 Kb2.0 Kb N50 scaffold3 Mb 4.1 Kb Avg base qual % > 2 kb

Neurospora crassa Genome (Real Data) 40 Mb genome, shotgun sequencing complete (Whitehead Genome Ctr) Coverage: 1705 contigs 368 scaffolds 1% uncovered (of finished BACs) Evaluated assembly using 1.5Mb of finished BACs Efficiency: Time: 20 hr Memory: 9 Gb Accuracy: < 3 misassemblies compared with 1 Gb of finished sequence Errors/10 6 letters: Subst. 260 Indel: 164

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 6 Types of misassemblies

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 8 Merging k-mer hits in the alignment module

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 9. Detection of chimeric reads. Reads l 1, l 2, l 3, r 1, r 2, and r 3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. We call x the point of chimerism of c. Note that reads l 3 and r 3 extend slightly beyond x, as often happens for real chimeric reads.

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 10. Contig assembly. If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c) = shift(a,c) - shift(a,b). We detect a repeat boundary toward the right of read a, if there is no overlap (b,c), nor any path of reads x1,..., xk such that (b,x1), (x1,x2),..., (xk,c) are all overlaps, and shift(b,x1) shift(xk,c) approx shift(a,c) - shift(a,b).

Subreads ● After contigs are created, subreads are inserted – Subreads can be completely contained in other reads – Subreads can also be completely contained within a contig but not within any one read ● Subreads only inserted if this operation is unambiguous ● Subread insertion improves scaffolding in subsequent steps – because it adds new mate-pair links

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 11 Consistency of forward-reverse links

Pairing contigs for scaffolding Two scaffolds S1,S2 have distance d(S1,S2) based on estimated basepair distance between them (contigs are “singleton scaffolds”) Priority score is: s(S1,S2) = f(k) - | d(S1,S2) | f(k) is a heuristic ‘reward’ for the number of links between S1,S2. f(2),f(3),f(4),... = 50, 875, 1700, 2025, 2350, 2475, 2600, 2625, 2650, 2675

Scaffold assembly 1.Create a priority queue Q with all pairs of contigs that are 1.not repetitive 2.linked by at least 2 forward-reverse pairs 2.Loop: Merge highest priority pair (S1,S2), creating new scaffold T 3.Remove all pairs in Q containing S1 or S2 4.Create new pairs (T,W) for all scaffolds W that share forward-reverse links with T

Scaffold assembly ● Scaffold assembly procedure on previous slides is run first using only short inserts (less than 10,000 bp) ● Entire procedure is then re-run using all links

Serafim Batzoglou et al. Genome Res. 2002; 12: Figure 12 Filling gaps in scaffolds. (A) Contigs A and B are connected by a path p of contigs X1,..., Xk. The distance dp(A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A or B. (B) Contigs Y1 and Y2 share forward-reverse links with the scaffold S. These links position them in the vicinity of the gap between A and B. Therefore, Y1 and Y2 will be used as possible stepping points in the path closing the gap from A to B.

Consensus sequence computation ● All contigs contain reads with approximate positions within those contigs ● Start at left end of each contig ● Move base-by-base, computing the consensus by a quality-weighted vote ● Switch to another read when: – at the end of the current read – at a deletion in the current read – at a low-quality region in the current read

Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

ARACHNE 2

Three major improvements ● Scaffold breaking and re-joining introduced ● Gaps can now be filled by individual reads, not just by contigs – This is equivalent to the “stones” method in the Celera Assembler ● Memory usage was reduced fourfold

David B. Jaffe et al. Genome Res. 2003; 13: Careful joining of scaffolds: minimize “stretching” of mate-pair links Figure 1. Joining of scaffolds. Three scaffolds (a, b, c) are seen off the end of scaffold s. There are two or more read pair links from s to each of them. Each has an optimal position relative to s, determined by the insert lengths corresponding to the read pairs. However, each insert length has a standard deviation associated to it, and so the positions of a, b, and c relative to s also have standard deviations. Supposing that we allow each of them to slide from their optimal positions by up to 2.5 standard deviations, but that we do not allow overlap between any of the scaffolds, is there more than one possible order for the scaffolds? Among the possible orders, does a always appear first (after s)? If so, we join scaffold s to scaffold a.

David B. Jaffe et al. Genome Res. 2003; 13: Scaffold breaking: look for regions where clone coverage = 1 Figure 2. A disguised instance where sequence join alone holds together a scaffold. A long scaffold (blue) from one part of the genome subsumes a small foreign inset (red) from a completely different part of the genome, held together by a single point of attachment within a contig (bicolor): in fact only a sequence join ties blue to red. This was not recognized in the version of the code which produced the released mouse assembly (Mouse Genome Sequencing Consortium 2002). Resolution: break at the bicolor juncture, move the red sequence to where it links in another scaffold.

David B. Jaffe et al. Genome Res. 2003; 13: Scaffold breaking: look for correlated links from the middle of one scaffold to another Figure 3. Positive breaking of scaffolds. Three correlated links are seen between scaffolds S1 and S2. The spread of the connection between S1 and S2 is, in this case, the lesser of 10 kb and 25 kb, which is 10 kb. Because the positive breaking algorithm as applied to mouse required five links with spread at least 50 kb, this connection would not have been sufficient to break the scaffolds. If it were, the respective scaffolds would have been broken at the exact ends of reads (green bars).

Mouse genome assembly  Improved version of ARACHNE assembled the mouse genome  Several heuristics that iteratively:  Break scaffolds that are suspicious  Rejoin scaffolds  Size of problem: 32,000,000 reads  Time: 15 days, 1 processor  Memory: 28 Gb  N50 Contig size: 16.3 Kb -> 24.8 Kb  N50 Scaffold size: 0.27 Mb -> 16.9 Mb