DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP
Why assemble a genome? Current DNA sequencing methods generate reads of 500-700 bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence AG-ICB-USP
Shotgun Sequencing I – random phase Sheared DNA: 1.0-2.0 kb BAC clone: 100-200 kb Random Reads Sequencing Templates Modified from BCM-HGSC AG-ICB-USP 3
Shotgun Sequencing II - assembly Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 4
Shotgun Sequencing III - finishing Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 5
Shotgun Sequencing III - finishing Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 6
Shotgun Sequencing III - finishing Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 7
Shotgun Sequencing III - finishing Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 8
Shotgun Sequencing III - finishing Consensus High Accuracy Sequence: < 1 error/ 10,000 bases Modified from BCM-HGSC AG-ICB-USP 9
How to deal with the enormous amount of reads generated by the high throughput DNA sequencers? Sanger Institute AG-ICB-USP
Sanger Institute - Hinxton - UK ABI3700 DNA sequencers AG-ICB-USP
Sanger Institute - Hinxton - UK ABI3730 DNA sequencers AG-ICB-USP
Sanger Institute - Hinxton - UK ABI3700 DNA sequencers AG-ICB-USP
Sanger Institute - Hinxton - UK Colony-picking robots AG-ICB-USP
Sanger Institute - Hinxton - UK Colony-picking robot AG-ICB-USP
Sanger Institute - Hinxton - UK Plasmid miniprep robots AG-ICB-USP
Sanger Institute - Hinxton - UK Plasmid miniprep rooom AG-ICB-USP
Sanger Institute - Hinxton - UK Thermocycler room AG-ICB-USP
Exponential growth of sequence generation AG-ICB-USP
Exponential growth of sequence generation AG-ICB-USP
Exponential growth of sequence generation AG-ICB-USP
Exponential growth of sequence generation AG-ICB-USP
Exponential growth of sequence generation Genetic Sequence Data Bank - October 15 2012 NCBI-GenBank Flat File Release 192.0 Distribution Release Notes: 157.889.737 loci, 145.430.961.262 bases …from 157.889.737 reported sequences AG-ICB-USP
Phred/Phrap/Consed Package Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly and error probability assignment to the consensus sequence; e. Assembly viewing and editing; f. Automatic finishing. AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Phred Genome Research 8: 175-185, 1998 AG-ICB-USP
Phred Genome Research 8: 186-194, 1998 AG-ICB-USP
Phred Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. AG-ICB-USP
Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. AG-ICB-USP
Trace File High quality read: - no ambiguities (Ns) - no noise - peaks very well spaced AG-ICB-USP
Trace File Good quality read: - no ambiguities (Ns) - some noise (notice baseline) - peaks very well spaced AG-ICB-USP
Trace File Poor quality read: - some ambiguities (Ns) - bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate AG-ICB-USP
Trace File Poor quality read: - many ambiguities (Ns) - noise - caused by homopolymeric region/polymerase slippage AG-ICB-USP
Trace File Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure AG-ICB-USP
Trace File High quality region: - no ambiguities (Ns) - no noise - peaks very well spaced AG-ICB-USP
Trace File Medium quality region: - some ambiguities (Ns) - no noise - peaks very well spaced - some homopolymeric strectches are not well resolved AG-ICB-USP
Trace File Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products: - overlapping peaks, peaks not evenly spaced - low resolution - low confidence to base assignment AG-ICB-USP
Phred Analysis steps a) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace b) Identifies observed peaks c) Compares observed and expected peaks (divides the peaks into matched and unmatched) d) Unmatched peaks are analyzed for any peak that could be called, but was not called in step c Modified from Evan Eichler, Ph.D AG-ICB-USP
Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) AG-ICB-USP
The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE AG-ICB-USP
AG-ICB-USP
c 57 1778 t 57 1792 g 57 1805 a 57 1820 t 57 1828 g 57 1841 t 57 1853 g 57 1867 a 68 1880 c 68 1889 a 68 1902 g 68 1915 c 68 1927 t 68 1941 c 68 1954 t 68 1967 c 68 1979 a 68 1991 c 68 2000 t 68 2014 c 57 2028 t 57 2040 a 57 2053 g 57 2063 a 41 2079 g 57 2087 g 57 2100 c 57 2112 t 59 2125 g 54 2138 t 57 2149 t 57 2162 g 57 2176 c 57 2186 a 57 2199 g 57 2212 a 57 2228 g 57 2237 g 57 2250 t 57 2263 c 57 2274 c 57 2287 g 57 2302 c 57 2311 g 57 2326 a 57 2341 t 57 2350 t 57 2364 c 68 2375 c 68 2388 t 68 2400 t 68 2414 g 68 2427 c 68 2439 a 68 2451 g 68 2462 c 68 2474 t 68 2488 g 68 2501 c 68 2511 a 68 2523 t 68 2535 a 68 2548 c 68 2559 t 68 2572 a 68 2584 c 68 2596 a 68 2609 AG-ICB-USP
AG-ICB-USP
t 28 6526 c 31 6539 g 32 6552 t 35 6562 a 35 6574 t 39 6585 g 47 6597 c 43 6608 c 41 6621 c 32 6632 c 31 6645 a 37 6655 c 21 6664 c 18 6678 a 9 6688 g 9 6708 g 9 6712 g 9 6721 a 18 6734 g 37 6745 a 36 6758 t 37 6767 t 37 6779 c 32 6792 g 22 6804 g 20 6816 a 23 6829 c 23 6837 c 24 6852 g 22 6863 g 22 6875 a 25 6889 c 25 6897 a 24 6908 g 31 6919 t 34 6932 a 37 6941 a 37 6952 t 41 6964 c 39 6976 g 39 6988 a 28 6997 a 21 7008 t 15 7017 t 15 7027 c 12 7034 c 13 7049 c 14 7062 g 32 7078 c 20 7090 g 18 7101 g 10 7112 c 9 7121 c 9 7137 g 9 7149 c 9 7156 c 9 7171 a 18 7182 t 25 7192 g 37 7204 g 39 7214 c 36 7228 g 36 7238 g 31 7249 c 22 7262 c 22 7276 g 22 7288 g 20 7296 g 20 7311 a 19 7324 g 21 7333 c 15 7344 a 16 7353 t 15 7366 AG-ICB-USP
AG-ICB-USP
g 25 7377 c 22 7389 g 26 7402 a 16 7414 c 24 7423 g 15 7437 t 28 7450 c 19 7459 g 19 7475 g 19 7484 g 16 7491 c 19 7506 c 19 7520 c 32 7530 a 34 7540 a 37 7552 t 31 7562 t 26 7575 c 27 7586 g 27 7599 c 23 7607 c 26 7620 c 26 7631 t 30 7642 a 30 7653 t 15 7663 a 12 7674 g 11 7687 t 12 7698 g 12 7708 a 26 7720 g 21 7730 t 34 7743 c 34 7755 g 37 7766 t 37 7777 a 32 7787 t 16 7797 t 10 7809 a 8 7817 c 8 7828 a 8 7847 t 22 7860 t 19 7872 c 30 7881 a 37 7889 c 37 7900 t 25 7912 g 24 7923 g 22 7935 c 13 7942 c 13 7953 g 10 7963 c 12 7979 g 8 7988 t 8 8002 t 8 8019 t 12 8023 t 8 8034 t 6 8050 t 6 8061 a 6 8066 c 8 8086 a 6 8092 t 6 8107 a 7 8117 a 8 8126 a 8 8131 g 8 8145 g 8 8153 AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Conversion of phd files into FASTA files phd2fasta script Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap! AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Vector screening Features: This step removes or screen out vector sequence before running phrap Program: Cross_match – a program for rapid sequence comparison and database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Command: cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 (query) are compared to sequences in seq_file2 (subject) - matches meeting relevant criteria are written to the standard output AG-ICB-USP
Vector screening Example: cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.out where: - ‘seqfile.fasta’ is a file containing multiple reads in FASTA format - ‘vector.seq’ is a file containing the vector sequences - ‘-minmatch’ and ‘-minscore’ are parameters for pairwise alignment - ‘-screen’ creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. Any region matching any part of a vector sequence is replaced by Xs. - ‘screen.out’ contains a list of the matches found - the .‘screen’ file is the input for phrap - if a ‘.qual’ file was created (i.e. seqfile.fasta.qual) , it has to be renamed to (seqfile.fasta.screen.qual) – phredPhrap script automatically performs this step! AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Phrap - Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Command: phrap –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file - all the sequences in the seq_file are compared to each other AG-ICB-USP
Phrap a. Uses the entire read content – no need for trimming. Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! AG-ICB-USP
Phrap Key Features: e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs AG-ICB-USP
Phrap output files *.contigs – fasta file containing the contigs Contigs with more than one read Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads Reads with no match to other read *.ace – allows for viewing the assembly using Consed AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Consed Genome Research 8: 195-202, 1998 AG-ICB-USP
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. AG-ICB-USP
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
Autofinish Genome Research 11: 614-625, 2001 AG-ICB-USP
Autofinish Features: - Autofinish is part of the Consed package. - It automatically chooses finishing reads in order to finish a project. - The “finished” status is defined by the user according to pre-defined parameters AG-ICB-USP
Autofinish Autofinish allows the user to: - Figure out how contigs are ordered and oriented - Close gaps - Improve the error rate - Cover every base by reads from at least 2 different subclones AG-ICB-USP
Autofinish Autofinish will suggest any of the following types of reads: Forward universal primer terminator reads Reverse universal primer terminator reads Custom primer reads with subclone template Custom primer reads with whole clone template Minilibraries PCR AG-ICB-USP
Autofinish Finishing procedure: Assemble new reads with existing reads Autofinish suggests reads Assemble new reads with existing reads Shotgun reads Make reads in lab AG-ICB-USP
How to get the programs - Solaris Supported platforms: Internet site: - Linux computers (i686, i386, EM64T, AMD64 ) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far) Internet site: http://http://www.phrap.org/phredphrapconsed.html - academic version AG-ICB-USP
Contacts - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu To obtain the programs, questions, bug reports, suggestions: - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu - Phred – Brent Ewing – bge@u.washington.edu - Consed – David Gordon – gordon@genome.washington.edu AG-ICB-USP
The Staden Package Medical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK (no more supported by the original team) – now open source Preparing sequence trace data for analysis for assembly pregap4 Graphical user interface Prepare trace data Automation Trace format conversion Quality analysis Vector clipping Contaminant screening Repeat searching. AG-ICB-USP
The Staden Package Assembly program Gap4 and Gap5 Assembly Contig joining Assembly checking Repeat searching Experiment suggestion Read pair analysis Contig editing Graphical views of contigs Database Note: ace files produced by a special version of Phrap can be viewed by Gap4 AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
The Staden Package - Sun Solaris - http://staden.sourceforge.net/ Supported platforms: - Sun Solaris - Compaq Tru64 UNIX (Alpha) - SGI Irix - Linux - MS Windows (Win9x, NT, 2000) Availability: - http://staden.sourceforge.net/ AG-ICB-USP
CAP3 - Sequence Assembly Program Genome Research 9: 868-877, 1999 AG-ICB-USP
CAP3 - Sequence Assembly Program Characteristics: - Makes use of quality values – qual files produced by Phred can be used by CAP3 - Produces an ace file compatible with Consed - Can also be used in Gap4 (Staden Package) - Program available at http://seq.cs.iastate.edu/ AG-ICB-USP
Finishing Problems Finishing can be a boring and difficult task due: DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). AG-ICB-USP
Finishing Problems Finishing can be a boring and difficult task due: DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. AG-ICB-USP