DNA Assembly Sanger Reads

Name: DNA Assembly Sanger Reads
Uploaded: 2017-08-26T16:10:33+00:00
Duration: PTM47S18
Channel: Asher Stafford
Description: DNA Assembly Sanger Reads

DNA Assembly Sanger Reads
Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Why assemble a genome? Current DNA sequencing methods generate reads of bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence AG-ICB-USP

Shotgun Sequencing I – random phase
Sheared DNA: kb BAC clone: kb Random Reads Sequencing Templates Modified from BCM-HGSC AG-ICB-USP 3

Shotgun Sequencing II - assembly
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 4

Shotgun Sequencing III - finishing
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 5

Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 6

Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 7

Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 8

Consensus High Accuracy Sequence: < 1 error/ 10,000 bases Modified from BCM-HGSC AG-ICB-USP 9

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?
Sanger Institute AG-ICB-USP

Sanger Institute - Hinxton - UK
ABI3700 DNA sequencers AG-ICB-USP

Colony-picking robots AG-ICB-USP

Colony-picking robot AG-ICB-USP

Plasmid miniprep robots AG-ICB-USP

Plasmid miniprep rooom AG-ICB-USP

Thermocycler room AG-ICB-USP

Exponential growth of sequence generation
AG-ICB-USP

Exponential growth of sequence generation
Genetic Sequence Data Bank - October NCBI-GenBank Flat File Release 192.0 Distribution Release Notes: loci, bases …from reported sequences AG-ICB-USP

Phred/Phrap/Consed Package
Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly and error probability assignment to the consensus sequence; e. Assembly viewing and editing; f. Automatic finishing. AG-ICB-USP

Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Phred Genome Research 8: 175-185, 1998
AG-ICB-USP

Phred Genome Research 8: 186-194, 1998
AG-ICB-USP

Phred Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. AG-ICB-USP

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. AG-ICB-USP

Trace File High quality read: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP

Trace File Good quality read: - no ambiguities (Ns)
- some noise (notice baseline) - peaks very well spaced AG-ICB-USP

Trace File Poor quality read: - some ambiguities (Ns)
- bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate AG-ICB-USP

Trace File Poor quality read: - many ambiguities (Ns) - noise
- caused by homopolymeric region/polymerase slippage AG-ICB-USP

Trace File Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure AG-ICB-USP

Trace File High quality region: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP

Trace File Medium quality region: - some ambiguities (Ns) - no noise
- peaks very well spaced - some homopolymeric strectches are not well resolved AG-ICB-USP

Trace File Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products: - overlapping peaks, peaks not evenly spaced - low resolution - low confidence to base assignment AG-ICB-USP

Phred Analysis steps a) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace b) Identifies observed peaks c) Compares observed and expected peaks (divides the peaks into matched and unmatched) d) Unmatched peaks are analyzed for any peak that could be called, but was not called in step c Modified from Evan Eichler, Ph.D AG-ICB-USP

Phred value formula q = - 10 x log10 (p) where q - quality value
p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) AG-ICB-USP

The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18: TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t a a a g c c t g g g g t g c c t a a t g t g t c g n c t t c t c c c t c g g a g g t a g t c a g c n c n n n n n n n c n END_DNA END_SEQUENCE AG-ICB-USP

AG-ICB-USP

c t g a t g t g a c a g c t c t c a c t c t a g a g g c t g t t g c a g a g g t c c g c g a t t c c t t g c a g c t g c a t a c t a c a AG-ICB-USP

AG-ICB-USP

t c g t a t g c c c c a c c a g g g a g a t t c g g a c c g g a c a g t a a t c g a a t t c c c g c g g c c g c c a t g g c g g c c g g g a g c a t AG-ICB-USP

AG-ICB-USP

g c g a c g t c g g g c c c a a t t c g c c c t a t a g t g a g t c g t a t t a c a t t c a c t g g c c g c g t t t t t t a c a t a a a g g AG-ICB-USP

Conversion of phd files into FASTA files phd2fasta script
Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap! AG-ICB-USP

Vector screening Features:
This step removes or screen out vector sequence before running phrap Program: Cross_match – a program for rapid sequence comparison and database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Command: cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 (query) are compared to sequences in seq_file2 (subject) - matches meeting relevant criteria are written to the standard output AG-ICB-USP

Vector screening Example:
cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.out where: - ‘seqfile.fasta’ is a file containing multiple reads in FASTA format - ‘vector.seq’ is a file containing the vector sequences - ‘-minmatch’ and ‘-minscore’ are parameters for pairwise alignment - ‘-screen’ creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. Any region matching any part of a vector sequence is replaced by Xs. - ‘screen.out’ contains a list of the matches found - the .‘screen’ file is the input for phrap - if a ‘.qual’ file was created (i.e. seqfile.fasta.qual) , it has to be renamed to (seqfile.fasta.screen.qual) – phredPhrap script automatically performs this step! AG-ICB-USP

Phrap - Phragment Assembly Program or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA sequence data Command: phrap –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file - all the sequences in the seq_file are compared to each other AG-ICB-USP

Phrap a. Uses the entire read content – no need for trimming.
Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! AG-ICB-USP

Phrap Key Features: e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs AG-ICB-USP

Phrap output files *.contigs – fasta file containing the contigs
Contigs with more than one read Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads Reads with no match to other read *.ace – allows for viewing the assembly using Consed AG-ICB-USP

Consed Genome Research 8: 195-202, 1998
AG-ICB-USP

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. AG-ICB-USP

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. AG-ICB-USP

AG-ICB-USP

Autofinish Genome Research 11: 614-625, 2001
AG-ICB-USP

Autofinish Features: - Autofinish is part of the Consed package.
- It automatically chooses finishing reads in order to finish a project. - The “finished” status is defined by the user according to pre-defined parameters AG-ICB-USP

Autofinish Autofinish allows the user to:
- Figure out how contigs are ordered and oriented - Close gaps - Improve the error rate - Cover every base by reads from at least 2 different subclones AG-ICB-USP

Autofinish Autofinish will suggest any of the following types of reads: Forward universal primer terminator reads Reverse universal primer terminator reads Custom primer reads with subclone template Custom primer reads with whole clone template Minilibraries PCR AG-ICB-USP

Autofinish Finishing procedure: Assemble new reads with existing reads
Autofinish suggests reads Assemble new reads with existing reads Shotgun reads Make reads in lab AG-ICB-USP

How to get the programs - Solaris Supported platforms: Internet site:
- Linux computers (i686, i386, EM64T, AMD64 ) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far) Internet site: - academic version AG-ICB-USP

Contacts - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu
To obtain the programs, questions, bug reports, suggestions: - Phrap/Cross_match/Swat – Phil Green – - Phred – Brent Ewing – - Consed – David Gordon – AG-ICB-USP

The Staden Package Medical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK
(no more supported by the original team) – now open source Preparing sequence trace data for analysis for assembly pregap4 Graphical user interface Prepare trace data Automation Trace format conversion Quality analysis Vector clipping Contaminant screening Repeat searching. AG-ICB-USP

The Staden Package Assembly program Gap4 and Gap5 Assembly
Contig joining Assembly checking Repeat searching Experiment suggestion Read pair analysis Contig editing Graphical views of contigs Database Note: ace files produced by a special version of Phrap can be viewed by Gap4 AG-ICB-USP

AG-ICB-USP

The Staden Package - Sun Solaris - http://staden.sourceforge.net/
Supported platforms: - Sun Solaris - Compaq Tru64 UNIX (Alpha) - SGI Irix - Linux - MS Windows (Win9x, NT, 2000) Availability: - AG-ICB-USP

CAP3 - Sequence Assembly Program Genome Research 9: 868-877, 1999
AG-ICB-USP

CAP3 - Sequence Assembly Program
Characteristics: - Makes use of quality values – qual files produced by Phred can be used by CAP3 - Produces an ace file compatible with Consed - Can also be used in Gap4 (Staden Package) - Program available at AG-ICB-USP

Finishing Problems Finishing can be a boring and difficult task due:
DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). AG-ICB-USP

Finishing Problems Finishing can be a boring and difficult task due:
DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. AG-ICB-USP

DNA Assembly Sanger Reads

Similar presentations

Presentation on theme: "DNA Assembly Sanger Reads"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DNA Assembly Sanger Reads

Similar presentations

Presentation on theme: "DNA Assembly Sanger Reads"— Presentation transcript:

Similar presentations

About project

Feedback