3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach (Celera, Gene Myers). Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now. ED Green, Nat Rev Genet 2, 573 (2001)
3. Lecture WS 2003/04Bioinformatics III2 Automatic sequencing
3. Lecture WS 2003/04Bioinformatics III3 Automated Sequencing nearly all automatic sequencing is done using the enzymatic dideoxy chain- termination method of Sanger (1977). Separation of fragments by gel electrophoresis. Readout of fragments labeled with fluorescent dyes. Computer analysis of gel images: - lane tracking – identify gel boundaries - lane profiling – sum each of 4 signals across lane width to create a profile - trace processing – deconvolute and smooth signal estimates + reduce noise - base-calling in which the processed trace is translated into a sequence of bases. Program Phred is quasi-standard for last step (base calling).
3. Lecture WS 2003/04Bioinformatics III4 Base Calling - Phred B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8, (1998). B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities. Genome Res 8, (1998). The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases.
3. Lecture WS 2003/04Bioinformatics III5 Base Calling - Phred Idealized traces would consist of evenly spaced, nonoverlapping peaks. Real traces deviate from this ideal due to imper- fections of the sequencing reactions, of gel electro- phoresis, and of trace processing. The first 50 or so peaks and peaks over 500 or so are particularly noisy. Quality: high – no ambiguities medium – some ambiguities Poor – low confidence
3. Lecture WS 2003/04Bioinformatics III6 Base Calling Algorithm 1 Locate Predicted Peaks find the idealized locations of the base peaks using Fourier methods. 2 Locate Observed Peaks scan 4 trace arrays for concave regions satisfying 2 v(i) v(i+1) + v(i-1) 3 Match Observed and Predicted Peaks a) find easy matches b) use dynamic programming to align those peaks not matched in a) c) match remaining observed peaks that seem to represent genuine bases 4 Find missed Peaks
Phred quality values q = - 10 log 10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = (1 error in 100 bases) q = 40 means p = (1 error in 10,000 bases)
Phred Phred performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.
3. Lecture WS 2003/04Bioinformatics III9 whole genome assembly: problem description The goal is to reconstruct an unknown source sequence (the genome) on {A, C, G, T} given many random short segments from the sequence, the shotgun reads. A read is a subsequence of nucleotides of length around 500, taken from a random place in the genome. The orientation of the read is either forward or reverse complement. Reads contain two kinds of errors: base substitutions and indels. Base substitutions occur with a frequency of ca. 0.5 – 2%. Indels occur roughly 10 times less frequently. Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb) or BACs (150 kb). Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III10 Whole Genome Assemblers TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995) PHRAP P. Green (1996) Celera Assembler CAP3 X. Huang, A. Madan, Genome Res 9, (1999) RePS J. Wang et al. Genome Res 12, (2002) Phusion (Sanger)J.C. Mullikin, Z. Ning, Genome Res 13, (2003) Arachne (Whitehead/MIT) Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001) most assemblers follow the same approach: overlap – layout - consensus
3. Lecture WS 2003/04Bioinformatics III11 CAP3 Assembler Removal of poor end regions of reads Computation of overlaps between reads Removal of false overlaps Construction of contigs Construction of multiple sequence alignments and generation of consensus sequences
3. Lecture WS 2003/04Bioinformatics III12 CAP3: Clipping of Low-Quality Regions Use base quality values (from Phred) and sequence similarities to compute 5‘ and 3‘ clipping positions of reads. Definition of good regions of a read: - any sufficiently long region of high-quality values that is similar to a region of another read OR - any sufficiently long region that is highly similar to a good high-quality region of another read Computation of the 5‘ and 3‘ clipping positions of read f. Read f has high local similarities to reads g and h. A pair of broken lines shows the start and end positions of a similarity. A thick line indicates the high quality region of a read. Huang, Madan, Genome Res 9, 868 (1999)
3. Lecture WS 2003/04Bioinformatics III13 Celera – compartmentalized shotgun assembler use preliminary data from both human genome assembly projects Huson et al. Bioinformatics 17, S132 (2001)
3. Lecture WS 2003/04Bioinformatics III14 Arachne program by Serafin Batzoglou (MIT, PhD thesis 2000) (i)create graph G of overlaps between pairs of reads of shotgun data (ii)process G for the purpose of constructing supercontigs of mapped reads. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III15 Earmuff links An important variation of whole-genome shotgun sequencing obtains reads from both ends of an insert, forward and backward. Since inserts are size-selected, the approximate distance of the pair of reads obtained from the ends of a fragment is known. These will be called earmuff links.
3. Lecture WS 2003/04Bioinformatics III16 Arachne: creation of overlap graph List of reads R = (r 1,..., r N ), N is number of reads. Each read r i has length l i < If both reads are taken from the endpoints of the same clone (earmuff link) r i has link to another read r j at specified distance d ij. First: create graph G of overlaps (edges) between pairs of reads (nodes). Pairs of reads in R need to be aligned. Since R can be very long, N 2 alignments are infeasible. Create table of occurences of k-mers (k long strings) in the reads, count the number of k-mer matches for each pair of reads. Then perform pairwise alignments between pairs of reads that contain more than a cutoff number of common k-mers. Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III17 Arachne: table of k-mer occurrences Find number of k-mer matches in the forward or reverse complement direction between each pair of reads in R. (1) Obtain all triplets (r,t,v) r = read in R t = index of a k-mer occuring in r v = direction of occurrence (forward or reverse complement) (2) sort the set of pairs according to k-mer indices t (3) use sorted list to create table T of quadrublets (r i, r j, f, v) where r i and r i are reads that contain at least one common k-mer, v is a direction, and f is the number of k-mers in common between r i and r j in direction v. Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III18 Arachne: table of k-mer occurrences Batzoglou PhD thesis (2002) Here: k = 3
3. Lecture WS 2003/04Bioinformatics III19 Arachne: table of k-mer occurrences If a k-mer occurs „too often“ likely part of a repeat sequence, we should not use it for detecting overlap. Implementation (1)find k-mer occurences (r,t,v) and sort into 64 files according to the first three nucleotides of each k-mer. (2)For i=1,64 load file in memory, sort according to t, store sorted file. end (3)load 64 sorted files in memory sequentially, create table T incrementally. In practice, k = 8 to 24. Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III20 Arachne: pairwise read alignments Perform pairwise alignments between reads that contain more than a cutoff number of common k-mers. When excluding those k-mers that are too common (larger than a second) cutoff it is guaranteed that only O(N) number of pairwise alignments will be performed. Only a small number of base substitutions and indels is allowed in an overlapping region of two aligned reads. Use dynamic programming alignment that disallows deviations of more than a few characters. Output of the alignment algorithm: for reads r i, r j quadrublets (b 1, b 2, e 1, e 2 ) of beginning b 1, b 2 and end e 1, e 2 positions of the detected overlap region. If a significant overlap region is detected (r i, r j, b 1, b 2, e 1, e 2 ) becomes a link in the overlap graph G. Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III21 Correcting errors in reads Batzoglou et al. Genome Res 12, 177 (2002) Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30.
3. Lecture WS 2003/04Bioinformatics III22 Partial alignments 3 partial alignments of length k=6 between a pair of reads coalesce to yield a single full alignment of length k=19. Vertical bars denote matching bases, whereas x‘s denote mismatches. This illustrates the commonly occurring situation where an extended k- mer hit is a full alignment between two reads. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III23 Ambiguity created by the presence of repeats In the absence of sequencing errors and repreats it would be simple to retrieve all retrievable pairwise distances of reads and to construct G. In the presence of repeats a link between two reads in G does not necessarily imply true overlap. A „repeat link“ is a link in G between two reads that come from different regions in the genome, and overlap in a repeated segment. Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III24 Arachne: processing of overlap graph Some of the repetition in the genome is efficiently masked before the creation of G by throwing away k-mers of high frequency when building T. Furthermore some heuristic algorithms are used to detect and delete repetitive links (not discussed here). Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III25 Merging contigs Batzoglou PhD thesis (2002) Sequence contigs are formed by merging together pairs of reads that can be merged without ambiguity. In practice the situation is much worse than shown here. Repeats are not 100% conserved between copies.
3. Lecture WS 2003/04Bioinformatics III26 Sequence contigs Batzoglou PhD thesis (2002)
3. Lecture WS 2003/04Bioinformatics III27 Using paired pairs of overlaps to merge reads Arachne searches for instances of two plasmids of similar insert size with sequence overlaps occurring at both ends paired pairs. Batzoglou et al. Genome Res 12, 177 (2002) (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged. Bottom: collection of paired pairs are merged into contigs, and consensus sequences are formed.
3. Lecture WS 2003/04Bioinformatics III28 Detection of repeat contigs Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B then R is probably a repeat linking to two unique regions to the right. Batzoglou et al. Genome Res 12, 177 (2002) Some of the identified contigs are repeat contigs in which nearly identical sequence from distinct regions are collapsed together. Detection by (a) repeat contigs usually have an unusually high depth of coverage. (b) they will typically have conflicting links to other contigs. After marking repeat contigs, the remaining contigs should represent the correctly assembled sequence.
3. Lecture WS 2003/04Bioinformatics III29 Supercontig creation and gap filling (A)A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, 3 contigs are joined into one supercontig. The layout now consists of a number of supercontigs with interleaved gaps. Most gaps belong to regions marked as repeat contigs, some correspond to regions of insufficient shotgun reads. (B)Arachne attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs. Batzoglou et al. Genome Res 12, 177 (2002) Unmarked contigs = unique contigs. Iteratively merge contigs into supercontigs.
3. Lecture WS 2003/04Bioinformatics III30 Contig assembly If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c)=shift(a,c)-shift(a,b). A repeat boundary is detected toward the right of read a, if there is no overlap (b,c), nor any path of reads x 1,..., x k such that (b,x 1 ), (x 1,x 2 )..., (x k,c) are all overlaps, and shift(b,x 1 ) shift(x k,c) shift(a,c) – shift(a,b). Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III31 Consistency of forward-reverse links (A)The distance d(A,B) (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward- reverse linked reads between them. (B)The distance d(B,C) between two contigs B,C that are linked to the same contig A can be estimated from their respective distances to the linked contig. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III32 Types of misassemblies (A)3 types of simple minor misas- semblies are shown: insertions, deletions, and hanging ends. In all cases, a contiguous segment (of a contig ore the genome) of less than 10 kb does not align in the expected location (with the genome or contig). (B) More misassemblies. First, two pieces of a contig align to distant parts of the genome. Second, adjacent contigs in a supercontig are aligned to distant parts of the genome. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III33 Filling gaps in supercontigs (A)Contigs A and B are connected by a path p of contigs X 1,..., X k. The distance d p (A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A and B. (B)Contigs Y 1 and Y 2 share forward- reverse links with the supercontig S. These links position them in the vicinity of the gap between A and B. Therefore, Y 1 and Y 2 will be used as possible stepping points in the path closing the gap from A to B. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III34 Detection of chimeric reads Reads l 1, l 2, l 3, r 1, r 2, and r 3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. Note that reads l 3 and r 3 extend slightly beyond x, as often happens for real chimeric reads. Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III35 Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III36 Characterization of Contigs Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III37 Characterization of Supercontigs Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III38 Base Pair Accuracy Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III39 Misassemblies Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III40 Computational Performance Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III41 Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002)
3. Lecture WS 2003/04Bioinformatics III42 Comparison of different assemblers Pevzner, Tang, Waterman PNAS 98, 9748 (2001) you should look out for: - smallest number of contigs + misassembled contigs - highest possible coverage by contigs - lowest possible coverage by misassembled contigs
3. Lecture WS 2003/04Bioinformatics III43 There is no error-free assembler to date Pevzner, Tang, Waterman PNAS 98, 9748 (2001) Comparative analysis of EULER, PHRAP, CAP, and TIGR assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the IDEAL assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs. Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, PHRAP misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome. „Biologists "pay" for these errors at the time-consuming finishing step“.
3. Lecture WS 2003/04Bioinformatics III44 What comes next? Finishing the genome Usually, the assembly of shotgun data is finished with a number of contigs with some remaining gaps. Also, within each contig there are some regions of high error rate. The goal of the finishing phase is then to get a single continuous contig with low error rate. „Finishers“ apply ad hoc rules to decide where additional data is necessary. This experimental data may then be generated in experiments using different chemistry or higher coverage. Autofinish (phrap group) is a program to help humans with deciding which new reads to get.
3. Lecture WS 2003/04Bioinformatics III45 Human experts are only rarely needed... D. Gordon, C. Desmarais, P. Green, Genome Res, 11, 614 (2001)