DNA Assembly Sanger Reads

Slides:

Advertisements

Similar presentations

In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.

Advertisements

Enter Presentation Everything you expect …plus DNASIS MAX 2.0 Sequence Analysis Software.

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.

Next Generation Sequencing, Assembly, and Alignment Methods

SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.

M B G Rui Pires Martins PhD Candidate, CMMG computer applications in molecular genetics.

Mining SNPs from EST Databases Picoult-Newberg et al. (1999)

Visual Basic and Perl Applications for Genome Project Management. Svetlana N.Yurgel* 1, Brenda K. Schroeder 1, Hao Jin 2 and Michael L. Kahn 1,3. Institute.

Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:

Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Bioinformatic Techniques & Tools for SNP Analysis

Genome sequencing and assembling

INDIAN INITIATIVE FOR TOMATO GENOME SEQUENCING Tomato Finishing Workshop T. R. Sharma National Research Centre on Plant Biotechnology Indian Agricultural.

Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.

Genome Sequencing. Bacteriophage fX174, the first genome to be sequenced, is a viral genome with only 5,368 base pairs (bp). Fred Sanger invented "shotgun"

CS 6293 Advanced Topics: Current Bioinformatics

© Wiley Publishing All Rights Reserved. Working with a Single DNA Sequence.

Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.

Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Mouse Genome Sequencing

Large-scale genome projects

Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.

Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9.

Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.

Introduction to Short Read Sequencing Analysis

Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.

Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.

A Sequenciação em Análises Clínicas Polymerase Chain Reaction.

SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.

EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.

Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.

中国农业科学院蔬菜花卉研究所 Institute of Vegetables and Flowers Chinese Academy of Agricultural Sciences Zhonghua Zhang Institute of Vegetables and Flowers, Chinese.

Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.

A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo.

Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.

1.Data production 2.General outline of assembly strategy.

GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.

2nd TOMATO FINISHING WORKSHOP chromosome 9 Wageningen, April 24-25, 2008.

Automatic DNA and Genome Sequencing

Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.

Sequence File Formats.

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Chapter 5 Sequence Assembly: Assembling the Human Genome.

Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.

Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.

Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.

Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.

Next-generation sequencing technology

Virginia Commonwealth University

Sequencing technologies

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Genome sequence assembly

Next-generation sequencing technology

The FASTQ format and quality control

A Sequenciação em Análises Clínicas

Molecular Cloning.

Introduction to Sequencing

Sequence the 3 billion base pairs of human

Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.

Presentation transcript:

DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Why assemble a genome? Current DNA sequencing methods generate reads of 500-700 bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence AG-ICB-USP

Shotgun Sequencing I – random phase Sheared DNA: 1.0-2.0 kb BAC clone: 100-200 kb Random Reads Sequencing Templates Modified from BCM-HGSC AG-ICB-USP 3

Shotgun Sequencing II - assembly Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 4

Shotgun Sequencing III - finishing Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 5

Shotgun Sequencing III - finishing Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 6

Shotgun Sequencing III - finishing Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 7

Shotgun Sequencing III - finishing Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 8

Shotgun Sequencing III - finishing Consensus High Accuracy Sequence: < 1 error/ 10,000 bases Modified from BCM-HGSC AG-ICB-USP 9

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers? Sanger Institute AG-ICB-USP

Sanger Institute - Hinxton - UK ABI3700 DNA sequencers AG-ICB-USP

Sanger Institute - Hinxton - UK ABI3730 DNA sequencers AG-ICB-USP

Sanger Institute - Hinxton - UK ABI3700 DNA sequencers AG-ICB-USP

Sanger Institute - Hinxton - UK Colony-picking robots AG-ICB-USP

Sanger Institute - Hinxton - UK Colony-picking robot AG-ICB-USP

Sanger Institute - Hinxton - UK Plasmid miniprep robots AG-ICB-USP

Sanger Institute - Hinxton - UK Plasmid miniprep rooom AG-ICB-USP

Sanger Institute - Hinxton - UK Thermocycler room AG-ICB-USP

Exponential growth of sequence generation AG-ICB-USP

Exponential growth of sequence generation AG-ICB-USP

Exponential growth of sequence generation AG-ICB-USP

Exponential growth of sequence generation AG-ICB-USP

Exponential growth of sequence generation Genetic Sequence Data Bank - October 15 2012 NCBI-GenBank Flat File Release 192.0 Distribution Release Notes: 157.889.737 loci, 145.430.961.262 bases …from 157.889.737 reported sequences AG-ICB-USP

Phred/Phrap/Consed Package Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly and error probability assignment to the consensus sequence; e. Assembly viewing and editing; f. Automatic finishing. AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Phred Genome Research 8: 175-185, 1998 AG-ICB-USP

Phred Genome Research 8: 186-194, 1998 AG-ICB-USP

Phred Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. AG-ICB-USP

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. AG-ICB-USP

Trace File High quality read: - no ambiguities (Ns) - no noise - peaks very well spaced AG-ICB-USP

Trace File Good quality read: - no ambiguities (Ns) - some noise (notice baseline) - peaks very well spaced AG-ICB-USP

Trace File Poor quality read: - some ambiguities (Ns) - bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate AG-ICB-USP

Trace File Poor quality read: - many ambiguities (Ns) - noise - caused by homopolymeric region/polymerase slippage AG-ICB-USP

Trace File Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure AG-ICB-USP

Trace File High quality region: - no ambiguities (Ns) - no noise - peaks very well spaced AG-ICB-USP

Trace File Medium quality region: - some ambiguities (Ns) - no noise - peaks very well spaced - some homopolymeric strectches are not well resolved AG-ICB-USP

Trace File Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products: - overlapping peaks, peaks not evenly spaced - low resolution - low confidence to base assignment AG-ICB-USP

Phred Analysis steps a) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace b) Identifies observed peaks c) Compares observed and expected peaks (divides the peaks into matched and unmatched) d) Unmatched peaks are analyzed for any peak that could be called, but was not called in step c Modified from Evan Eichler, Ph.D AG-ICB-USP

Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) AG-ICB-USP

The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE AG-ICB-USP

AG-ICB-USP

c 57 1778 t 57 1792 g 57 1805 a 57 1820 t 57 1828 g 57 1841 t 57 1853 g 57 1867 a 68 1880 c 68 1889 a 68 1902 g 68 1915 c 68 1927 t 68 1941 c 68 1954 t 68 1967 c 68 1979 a 68 1991 c 68 2000 t 68 2014 c 57 2028 t 57 2040 a 57 2053 g 57 2063 a 41 2079 g 57 2087 g 57 2100 c 57 2112 t 59 2125 g 54 2138 t 57 2149 t 57 2162 g 57 2176 c 57 2186 a 57 2199 g 57 2212 a 57 2228 g 57 2237 g 57 2250 t 57 2263 c 57 2274 c 57 2287 g 57 2302 c 57 2311 g 57 2326 a 57 2341 t 57 2350 t 57 2364 c 68 2375 c 68 2388 t 68 2400 t 68 2414 g 68 2427 c 68 2439 a 68 2451 g 68 2462 c 68 2474 t 68 2488 g 68 2501 c 68 2511 a 68 2523 t 68 2535 a 68 2548 c 68 2559 t 68 2572 a 68 2584 c 68 2596 a 68 2609 AG-ICB-USP

AG-ICB-USP

t 28 6526 c 31 6539 g 32 6552 t 35 6562 a 35 6574 t 39 6585 g 47 6597 c 43 6608 c 41 6621 c 32 6632 c 31 6645 a 37 6655 c 21 6664 c 18 6678 a 9 6688 g 9 6708 g 9 6712 g 9 6721 a 18 6734 g 37 6745 a 36 6758 t 37 6767 t 37 6779 c 32 6792 g 22 6804 g 20 6816 a 23 6829 c 23 6837 c 24 6852 g 22 6863 g 22 6875 a 25 6889 c 25 6897 a 24 6908 g 31 6919 t 34 6932 a 37 6941 a 37 6952 t 41 6964 c 39 6976 g 39 6988 a 28 6997 a 21 7008 t 15 7017 t 15 7027 c 12 7034 c 13 7049 c 14 7062 g 32 7078 c 20 7090 g 18 7101 g 10 7112 c 9 7121 c 9 7137 g 9 7149 c 9 7156 c 9 7171 a 18 7182 t 25 7192 g 37 7204 g 39 7214 c 36 7228 g 36 7238 g 31 7249 c 22 7262 c 22 7276 g 22 7288 g 20 7296 g 20 7311 a 19 7324 g 21 7333 c 15 7344 a 16 7353 t 15 7366 AG-ICB-USP

AG-ICB-USP

g 25 7377 c 22 7389 g 26 7402 a 16 7414 c 24 7423 g 15 7437 t 28 7450 c 19 7459 g 19 7475 g 19 7484 g 16 7491 c 19 7506 c 19 7520 c 32 7530 a 34 7540 a 37 7552 t 31 7562 t 26 7575 c 27 7586 g 27 7599 c 23 7607 c 26 7620 c 26 7631 t 30 7642 a 30 7653 t 15 7663 a 12 7674 g 11 7687 t 12 7698 g 12 7708 a 26 7720 g 21 7730 t 34 7743 c 34 7755 g 37 7766 t 37 7777 a 32 7787 t 16 7797 t 10 7809 a 8 7817 c 8 7828 a 8 7847 t 22 7860 t 19 7872 c 30 7881 a 37 7889 c 37 7900 t 25 7912 g 24 7923 g 22 7935 c 13 7942 c 13 7953 g 10 7963 c 12 7979 g 8 7988 t 8 8002 t 8 8019 t 12 8023 t 8 8034 t 6 8050 t 6 8061 a 6 8066 c 8 8086 a 6 8092 t 6 8107 a 7 8117 a 8 8126 a 8 8131 g 8 8145 g 8 8153 AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Conversion of phd files into FASTA files phd2fasta script Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap! AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Vector screening Features: This step removes or screen out vector sequence before running phrap Program: Cross_match – a program for rapid sequence comparison and database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Command: cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 (query) are compared to sequences in seq_file2 (subject) - matches meeting relevant criteria are written to the standard output AG-ICB-USP

Vector screening Example: cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.out where: - ‘seqfile.fasta’ is a file containing multiple reads in FASTA format - ‘vector.seq’ is a file containing the vector sequences - ‘-minmatch’ and ‘-minscore’ are parameters for pairwise alignment - ‘-screen’ creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. Any region matching any part of a vector sequence is replaced by Xs. - ‘screen.out’ contains a list of the matches found - the .‘screen’ file is the input for phrap - if a ‘.qual’ file was created (i.e. seqfile.fasta.qual) , it has to be renamed to (seqfile.fasta.screen.qual) – phredPhrap script automatically performs this step! AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Phrap - Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Command: phrap –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file - all the sequences in the seq_file are compared to each other AG-ICB-USP

Phrap a. Uses the entire read content – no need for trimming. Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! AG-ICB-USP

Phrap Key Features: e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs AG-ICB-USP

Phrap output files *.contigs – fasta file containing the contigs Contigs with more than one read Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads Reads with no match to other read *.ace – allows for viewing the assembly using Consed AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Consed Genome Research 8: 195-202, 1998 AG-ICB-USP

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. AG-ICB-USP

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

Phred/Phrap/Consed Pipeline Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

Autofinish Genome Research 11: 614-625, 2001 AG-ICB-USP

Autofinish Features: - Autofinish is part of the Consed package. - It automatically chooses finishing reads in order to finish a project. - The “finished” status is defined by the user according to pre-defined parameters AG-ICB-USP

Autofinish Autofinish allows the user to: - Figure out how contigs are ordered and oriented - Close gaps - Improve the error rate - Cover every base by reads from at least 2 different subclones AG-ICB-USP

Autofinish Autofinish will suggest any of the following types of reads: Forward universal primer terminator reads Reverse universal primer terminator reads Custom primer reads with subclone template Custom primer reads with whole clone template Minilibraries PCR AG-ICB-USP

Autofinish Finishing procedure: Assemble new reads with existing reads Autofinish suggests reads Assemble new reads with existing reads Shotgun reads Make reads in lab AG-ICB-USP

How to get the programs - Solaris Supported platforms: Internet site: - Linux computers (i686, i386, EM64T, AMD64 ) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far) Internet site: http://http://www.phrap.org/phredphrapconsed.html - academic version AG-ICB-USP

Contacts - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu To obtain the programs, questions, bug reports, suggestions: - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu - Phred – Brent Ewing – bge@u.washington.edu - Consed – David Gordon – gordon@genome.washington.edu AG-ICB-USP

The Staden Package Medical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK (no more supported by the original team) – now open source Preparing sequence trace data for analysis for assembly pregap4 Graphical user interface Prepare trace data Automation Trace format conversion Quality analysis Vector clipping Contaminant screening Repeat searching. AG-ICB-USP

The Staden Package Assembly program Gap4 and Gap5 Assembly Contig joining Assembly checking Repeat searching Experiment suggestion Read pair analysis Contig editing Graphical views of contigs Database Note: ace files produced by a special version of Phrap can be viewed by Gap4 AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

The Staden Package - Sun Solaris - http://staden.sourceforge.net/ Supported platforms: - Sun Solaris - Compaq Tru64 UNIX (Alpha) - SGI Irix - Linux - MS Windows (Win9x, NT, 2000) Availability: - http://staden.sourceforge.net/ AG-ICB-USP

CAP3 - Sequence Assembly Program Genome Research 9: 868-877, 1999 AG-ICB-USP

CAP3 - Sequence Assembly Program Characteristics: - Makes use of quality values – qual files produced by Phred can be used by CAP3 - Produces an ace file compatible with Consed - Can also be used in Gap4 (Staden Package) - Program available at http://seq.cs.iastate.edu/ AG-ICB-USP

Finishing Problems Finishing can be a boring and difficult task due: DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). AG-ICB-USP

Finishing Problems Finishing can be a boring and difficult task due: DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. AG-ICB-USP