Download presentation
1
DNA Assembly Sanger Reads
Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP
2
Why assemble a genome? Current DNA sequencing methods generate reads of bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence AG-ICB-USP
3
Shotgun Sequencing I – random phase
Sheared DNA: kb BAC clone: kb Random Reads Sequencing Templates Modified from BCM-HGSC AG-ICB-USP 3
4
Shotgun Sequencing II - assembly
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 4
5
Shotgun Sequencing III - finishing
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 5
6
Shotgun Sequencing III - finishing
Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 6
7
Shotgun Sequencing III - finishing
Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 7
8
Shotgun Sequencing III - finishing
Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 8
9
Shotgun Sequencing III - finishing
Consensus High Accuracy Sequence: < 1 error/ 10,000 bases Modified from BCM-HGSC AG-ICB-USP 9
10
How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?
Sanger Institute AG-ICB-USP
11
Sanger Institute - Hinxton - UK
ABI3700 DNA sequencers AG-ICB-USP
12
Sanger Institute - Hinxton - UK
ABI3730 DNA sequencers AG-ICB-USP
13
Sanger Institute - Hinxton - UK
ABI3700 DNA sequencers AG-ICB-USP
14
Sanger Institute - Hinxton - UK
Colony-picking robots AG-ICB-USP
15
Sanger Institute - Hinxton - UK
Colony-picking robot AG-ICB-USP
16
Sanger Institute - Hinxton - UK
Plasmid miniprep robots AG-ICB-USP
17
Sanger Institute - Hinxton - UK
Plasmid miniprep rooom AG-ICB-USP
18
Sanger Institute - Hinxton - UK
Thermocycler room AG-ICB-USP
19
Exponential growth of sequence generation
AG-ICB-USP
20
Exponential growth of sequence generation
AG-ICB-USP
21
Exponential growth of sequence generation
AG-ICB-USP
22
Exponential growth of sequence generation
AG-ICB-USP
23
Exponential growth of sequence generation
Genetic Sequence Data Bank - October NCBI-GenBank Flat File Release 192.0 Distribution Release Notes: loci, bases …from reported sequences AG-ICB-USP
24
Phred/Phrap/Consed Package
Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly and error probability assignment to the consensus sequence; e. Assembly viewing and editing; f. Automatic finishing. AG-ICB-USP
25
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
26
Phred Genome Research 8: 175-185, 1998
AG-ICB-USP
27
Phred Genome Research 8: 186-194, 1998
AG-ICB-USP
28
Phred Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. AG-ICB-USP
29
Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. AG-ICB-USP
30
Trace File High quality read: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP
31
Trace File Good quality read: - no ambiguities (Ns)
- some noise (notice baseline) - peaks very well spaced AG-ICB-USP
32
Trace File Poor quality read: - some ambiguities (Ns)
- bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate AG-ICB-USP
33
Trace File Poor quality read: - many ambiguities (Ns) - noise
- caused by homopolymeric region/polymerase slippage AG-ICB-USP
34
Trace File Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure AG-ICB-USP
35
Trace File High quality region: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP
36
Trace File Medium quality region: - some ambiguities (Ns) - no noise
- peaks very well spaced - some homopolymeric strectches are not well resolved AG-ICB-USP
37
Trace File Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products: - overlapping peaks, peaks not evenly spaced - low resolution - low confidence to base assignment AG-ICB-USP
38
Phred Analysis steps a) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace b) Identifies observed peaks c) Compares observed and expected peaks (divides the peaks into matched and unmatched) d) Unmatched peaks are analyzed for any peak that could be called, but was not called in step c Modified from Evan Eichler, Ph.D AG-ICB-USP
39
Phred value formula q = - 10 x log10 (p) where q - quality value
p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) AG-ICB-USP
40
The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18: TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t a a a g c c t g g g g t g c c t a a t g t g t c g n c t t c t c c c t c g g a g g t a g t c a g c n c n n n n n n n c n END_DNA END_SEQUENCE AG-ICB-USP
41
AG-ICB-USP
42
c t g a t g t g a c a g c t c t c a c t c t a g a g g c t g t t g c a g a g g t c c g c g a t t c c t t g c a g c t g c a t a c t a c a AG-ICB-USP
43
AG-ICB-USP
44
t c g t a t g c c c c a c c a g g g a g a t t c g g a c c g g a c a g t a a t c g a a t t c c c g c g g c c g c c a t g g c g g c c g g g a g c a t AG-ICB-USP
45
AG-ICB-USP
46
g c g a c g t c g g g c c c a a t t c g c c c t a t a g t g a g t c g t a t t a c a t t c a c t g g c c g c g t t t t t t a c a t a a a g g AG-ICB-USP
47
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
48
Conversion of phd files into FASTA files phd2fasta script
Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap! AG-ICB-USP
49
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
50
Vector screening Features:
This step removes or screen out vector sequence before running phrap Program: Cross_match – a program for rapid sequence comparison and database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Command: cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 (query) are compared to sequences in seq_file2 (subject) - matches meeting relevant criteria are written to the standard output AG-ICB-USP
51
Vector screening Example:
cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.out where: - ‘seqfile.fasta’ is a file containing multiple reads in FASTA format - ‘vector.seq’ is a file containing the vector sequences - ‘-minmatch’ and ‘-minscore’ are parameters for pairwise alignment - ‘-screen’ creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. Any region matching any part of a vector sequence is replaced by Xs. - ‘screen.out’ contains a list of the matches found - the .‘screen’ file is the input for phrap - if a ‘.qual’ file was created (i.e. seqfile.fasta.qual) , it has to be renamed to (seqfile.fasta.screen.qual) – phredPhrap script automatically performs this step! AG-ICB-USP
52
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
53
Phrap - Phragment Assembly Program or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA sequence data Command: phrap –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file - all the sequences in the seq_file are compared to each other AG-ICB-USP
54
Phrap a. Uses the entire read content – no need for trimming.
Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! AG-ICB-USP
55
Phrap Key Features: e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs AG-ICB-USP
56
Phrap output files *.contigs – fasta file containing the contigs
Contigs with more than one read Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads Reads with no match to other read *.ace – allows for viewing the assembly using Consed AG-ICB-USP
57
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
58
Consed Genome Research 8: 195-202, 1998
AG-ICB-USP
59
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. AG-ICB-USP
60
Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. AG-ICB-USP
61
AG-ICB-USP
62
AG-ICB-USP
63
AG-ICB-USP
64
AG-ICB-USP
65
AG-ICB-USP
66
AG-ICB-USP
67
AG-ICB-USP
68
AG-ICB-USP
69
AG-ICB-USP
70
Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP
71
Autofinish Genome Research 11: 614-625, 2001
AG-ICB-USP
72
Autofinish Features: - Autofinish is part of the Consed package.
- It automatically chooses finishing reads in order to finish a project. - The “finished” status is defined by the user according to pre-defined parameters AG-ICB-USP
73
Autofinish Autofinish allows the user to:
- Figure out how contigs are ordered and oriented - Close gaps - Improve the error rate - Cover every base by reads from at least 2 different subclones AG-ICB-USP
74
Autofinish Autofinish will suggest any of the following types of reads: Forward universal primer terminator reads Reverse universal primer terminator reads Custom primer reads with subclone template Custom primer reads with whole clone template Minilibraries PCR AG-ICB-USP
75
Autofinish Finishing procedure: Assemble new reads with existing reads
Autofinish suggests reads Assemble new reads with existing reads Shotgun reads Make reads in lab AG-ICB-USP
76
How to get the programs - Solaris Supported platforms: Internet site:
- Linux computers (i686, i386, EM64T, AMD64 ) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far) Internet site: - academic version AG-ICB-USP
77
Contacts - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu
To obtain the programs, questions, bug reports, suggestions: - Phrap/Cross_match/Swat – Phil Green – - Phred – Brent Ewing – - Consed – David Gordon – AG-ICB-USP
78
The Staden Package Medical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK
(no more supported by the original team) – now open source Preparing sequence trace data for analysis for assembly pregap4 Graphical user interface Prepare trace data Automation Trace format conversion Quality analysis Vector clipping Contaminant screening Repeat searching. AG-ICB-USP
79
The Staden Package Assembly program Gap4 and Gap5 Assembly
Contig joining Assembly checking Repeat searching Experiment suggestion Read pair analysis Contig editing Graphical views of contigs Database Note: ace files produced by a special version of Phrap can be viewed by Gap4 AG-ICB-USP
80
AG-ICB-USP
81
AG-ICB-USP
82
AG-ICB-USP
83
AG-ICB-USP
84
AG-ICB-USP
85
The Staden Package - Sun Solaris - http://staden.sourceforge.net/
Supported platforms: - Sun Solaris - Compaq Tru64 UNIX (Alpha) - SGI Irix - Linux - MS Windows (Win9x, NT, 2000) Availability: - AG-ICB-USP
86
CAP3 - Sequence Assembly Program Genome Research 9: 868-877, 1999
AG-ICB-USP
87
CAP3 - Sequence Assembly Program
Characteristics: - Makes use of quality values – qual files produced by Phred can be used by CAP3 - Produces an ace file compatible with Consed - Can also be used in Gap4 (Staden Package) - Program available at AG-ICB-USP
88
Finishing Problems Finishing can be a boring and difficult task due:
DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). AG-ICB-USP
89
Finishing Problems Finishing can be a boring and difficult task due:
DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. AG-ICB-USP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.