Presentation is loading. Please wait.

Presentation is loading. Please wait.

DNA Assembly Sanger Reads

Similar presentations


Presentation on theme: "DNA Assembly Sanger Reads"— Presentation transcript:

1 DNA Assembly Sanger Reads
Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

2 Why assemble a genome? Current DNA sequencing methods generate reads of bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence AG-ICB-USP

3 Shotgun Sequencing I – random phase
Sheared DNA: kb BAC clone: kb Random Reads Sequencing Templates Modified from BCM-HGSC AG-ICB-USP 3

4 Shotgun Sequencing II - assembly
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 4

5 Shotgun Sequencing III - finishing
Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 5

6 Shotgun Sequencing III - finishing
Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 6

7 Shotgun Sequencing III - finishing
Mis-Assembly (Inverted) Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 7

8 Shotgun Sequencing III - finishing
Sequence Gap Consensus Modified from BCM-HGSC AG-ICB-USP 8

9 Shotgun Sequencing III - finishing
Consensus High Accuracy Sequence: < 1 error/ 10,000 bases Modified from BCM-HGSC AG-ICB-USP 9

10 How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?
Sanger Institute AG-ICB-USP

11 Sanger Institute - Hinxton - UK
ABI3700 DNA sequencers AG-ICB-USP

12 Sanger Institute - Hinxton - UK
ABI3730 DNA sequencers AG-ICB-USP

13 Sanger Institute - Hinxton - UK
ABI3700 DNA sequencers AG-ICB-USP

14 Sanger Institute - Hinxton - UK
Colony-picking robots AG-ICB-USP

15 Sanger Institute - Hinxton - UK
Colony-picking robot AG-ICB-USP

16 Sanger Institute - Hinxton - UK
Plasmid miniprep robots AG-ICB-USP

17 Sanger Institute - Hinxton - UK
Plasmid miniprep rooom AG-ICB-USP

18 Sanger Institute - Hinxton - UK
Thermocycler room AG-ICB-USP

19 Exponential growth of sequence generation
AG-ICB-USP

20 Exponential growth of sequence generation
AG-ICB-USP

21 Exponential growth of sequence generation
AG-ICB-USP

22 Exponential growth of sequence generation
AG-ICB-USP

23 Exponential growth of sequence generation
Genetic Sequence Data Bank - October NCBI-GenBank Flat File Release 192.0 Distribution Release Notes: loci, bases …from reported sequences AG-ICB-USP

24 Phred/Phrap/Consed Package
Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly and error probability assignment to the consensus sequence; e. Assembly viewing and editing; f. Automatic finishing. AG-ICB-USP

25 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

26 Phred Genome Research 8: 175-185, 1998
AG-ICB-USP

27 Phred Genome Research 8: 186-194, 1998
AG-ICB-USP

28 Phred Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. AG-ICB-USP

29 Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. AG-ICB-USP

30 Trace File High quality read: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP

31 Trace File Good quality read: - no ambiguities (Ns)
- some noise (notice baseline) - peaks very well spaced AG-ICB-USP

32 Trace File Poor quality read: - some ambiguities (Ns)
- bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate AG-ICB-USP

33 Trace File Poor quality read: - many ambiguities (Ns) - noise
- caused by homopolymeric region/polymerase slippage AG-ICB-USP

34 Trace File Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure AG-ICB-USP

35 Trace File High quality region: - no ambiguities (Ns) - no noise
- peaks very well spaced AG-ICB-USP

36 Trace File Medium quality region: - some ambiguities (Ns) - no noise
- peaks very well spaced - some homopolymeric strectches are not well resolved AG-ICB-USP

37 Trace File Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products: - overlapping peaks, peaks not evenly spaced - low resolution - low confidence to base assignment AG-ICB-USP

38 Phred Analysis steps a) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace b) Identifies observed peaks c) Compares observed and expected peaks (divides the peaks into matched and unmatched) d) Unmatched peaks are analyzed for any peak that could be called, but was not called in step c Modified from Evan Eichler, Ph.D AG-ICB-USP

39 Phred value formula q = - 10 x log10 (p) where q - quality value
p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) AG-ICB-USP

40 The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18: TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t a a a g c c t g g g g t g c c t a a t g t g t c g n c t t c t c c c t c g g a g g t a g t c a g c n c n n n n n n n c n END_DNA END_SEQUENCE AG-ICB-USP

41 AG-ICB-USP

42 c t g a t g t g a c a g c t c t c a c t c t a g a g g c t g t t g c a g a g g t c c g c g a t t c c t t g c a g c t g c a t a c t a c a AG-ICB-USP

43 AG-ICB-USP

44 t c g t a t g c c c c a c c a g g g a g a t t c g g a c c g g a c a g t a a t c g a a t t c c c g c g g c c g c c a t g g c g g c c g g g a g c a t AG-ICB-USP

45 AG-ICB-USP

46 g c g a c g t c g g g c c c a a t t c g c c c t a t a g t g a g t c g t a t t a c a t t c a c t g g c c g c g t t t t t t a c a t a a a g g AG-ICB-USP

47 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

48 Conversion of phd files into FASTA files phd2fasta script
Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap! AG-ICB-USP

49 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

50 Vector screening Features:
This step removes or screen out vector sequence before running phrap Program: Cross_match – a program for rapid sequence comparison and database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Command: cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 (query) are compared to sequences in seq_file2 (subject) - matches meeting relevant criteria are written to the standard output AG-ICB-USP

51 Vector screening Example:
cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.out where: - ‘seqfile.fasta’ is a file containing multiple reads in FASTA format - ‘vector.seq’ is a file containing the vector sequences - ‘-minmatch’ and ‘-minscore’ are parameters for pairwise alignment - ‘-screen’ creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. Any region matching any part of a vector sequence is replaced by Xs. - ‘screen.out’ contains a list of the matches found - the .‘screen’ file is the input for phrap - if a ‘.qual’ file was created (i.e. seqfile.fasta.qual) , it has to be renamed to (seqfile.fasta.screen.qual) – phredPhrap script automatically performs this step! AG-ICB-USP

52 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

53 Phrap - Phragment Assembly Program or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA sequence data Command: phrap –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file - all the sequences in the seq_file are compared to each other AG-ICB-USP

54 Phrap a. Uses the entire read content – no need for trimming.
Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! AG-ICB-USP

55 Phrap Key Features: e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs AG-ICB-USP

56 Phrap output files *.contigs – fasta file containing the contigs
Contigs with more than one read Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads Reads with no match to other read *.ace – allows for viewing the assembly using Consed AG-ICB-USP

57 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

58 Consed Genome Research 8: 195-202, 1998
AG-ICB-USP

59 Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. AG-ICB-USP

60 Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. AG-ICB-USP

61 AG-ICB-USP

62 AG-ICB-USP

63 AG-ICB-USP

64 AG-ICB-USP

65 AG-ICB-USP

66 AG-ICB-USP

67 AG-ICB-USP

68 AG-ICB-USP

69 AG-ICB-USP

70 Phred/Phrap/Consed Pipeline
Directories: chromat_dir phd_dir edit_dir AG-ICB-USP

71 Autofinish Genome Research 11: 614-625, 2001
AG-ICB-USP

72 Autofinish Features: - Autofinish is part of the Consed package.
- It automatically chooses finishing reads in order to finish a project. - The “finished” status is defined by the user according to pre-defined parameters AG-ICB-USP

73 Autofinish Autofinish allows the user to:
- Figure out how contigs are ordered and oriented - Close gaps - Improve the error rate - Cover every base by reads from at least 2 different subclones AG-ICB-USP

74 Autofinish Autofinish will suggest any of the following types of reads: Forward universal primer terminator reads Reverse universal primer terminator reads Custom primer reads with subclone template Custom primer reads with whole clone template Minilibraries PCR AG-ICB-USP

75 Autofinish Finishing procedure: Assemble new reads with existing reads
Autofinish suggests reads Assemble new reads with existing reads Shotgun reads Make reads in lab AG-ICB-USP

76 How to get the programs - Solaris Supported platforms: Internet site:
- Linux computers (i686, i386, EM64T, AMD64 ) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far) Internet site: - academic version AG-ICB-USP

77 Contacts - Phrap/Cross_match/Swat – Phil Green – phg@u.washington.edu
To obtain the programs, questions, bug reports, suggestions: - Phrap/Cross_match/Swat – Phil Green – - Phred – Brent Ewing – - Consed – David Gordon – AG-ICB-USP

78 The Staden Package Medical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK
(no more supported by the original team) – now open source Preparing sequence trace data for analysis for assembly pregap4 Graphical user interface Prepare trace data Automation Trace format conversion Quality analysis Vector clipping Contaminant screening Repeat searching. AG-ICB-USP

79 The Staden Package Assembly program Gap4 and Gap5 Assembly
Contig joining Assembly checking Repeat searching Experiment suggestion Read pair analysis Contig editing Graphical views of contigs Database Note: ace files produced by a special version of Phrap can be viewed by Gap4 AG-ICB-USP

80 AG-ICB-USP

81 AG-ICB-USP

82 AG-ICB-USP

83 AG-ICB-USP

84 AG-ICB-USP

85 The Staden Package - Sun Solaris - http://staden.sourceforge.net/
Supported platforms: - Sun Solaris - Compaq Tru64 UNIX (Alpha) - SGI Irix - Linux - MS Windows (Win9x, NT, 2000) Availability: - AG-ICB-USP

86 CAP3 - Sequence Assembly Program Genome Research 9: 868-877, 1999
AG-ICB-USP

87 CAP3 - Sequence Assembly Program
Characteristics: - Makes use of quality values – qual files produced by Phred can be used by CAP3 - Produces an ace file compatible with Consed - Can also be used in Gap4 (Staden Package) - Program available at AG-ICB-USP

88 Finishing Problems Finishing can be a boring and difficult task due:
DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). AG-ICB-USP

89 Finishing Problems Finishing can be a boring and difficult task due:
DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. AG-ICB-USP


Download ppt "DNA Assembly Sanger Reads"

Similar presentations


Ads by Google