Download presentation
Presentation is loading. Please wait.
Published byBethanie Blake Modified over 9 years ago
1
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly
2
Outline Input Data Sequence read data Pipeline Review Un-processed data Assemblers Preliminary data – assembler comparison Visualization Future
3
Input Data V. navarrensisV. vulnificus 2423-012009V-1368 08-246206-2432 2541-9008-2435 2756-8108-2439 -07-2444 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
4
Vibrio navarrensis- 454 SequenceID2423-0108-24622541-902756-81 Min. Read Length 21 bp25 bp19 bp28 bp Max. Read Length 738 bp573 bp704 bp Avg. Read Length 423.27 (± 117.36 bp) 401.80 (± 117.12 bp) 416.23 (± 125.84 bp) 423.53(± 117.19 bp) Total Reads160,56013,854303,434218,021 Coverage15x1.23x28.06x20.51x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
5
Vibrio vulnificus- 454 SequenceID2009V-136806-243208-243508-243907-2444 Min. Read Length 26 bp21 bp23 bp22 bp18 bp Max. Read Length 593 bp597 bp723 bp594 bp736 bp Avg. Read Length 416.05(± 123.19 bp) 371.91(± 112.13bp) 416.98 (± 121.56 bp) 418.12 (± 120.88 bp) 368.78(± 115.96 bp) Total Reads191,280786,944352,726173,538777,228 Coverage17x65x32x16x63x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
6
Vibrio navarrensis- Illumina SequenceID2423-0108-24622541-902756-81 Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads19,316,65929,414,237126,298,69192,338,634 Coverage326x496x250x237x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
7
Vibrio vulnificus- Illumina SequenceID2009V-136806-243208-243508-243907-2444 Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads15,764,32914,562,25215,343,64816,007,89515,495,709 Coverage~250x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
8
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
9
Vibrio vulnificus- 454 Metric13682432243524392444 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content
10
Vibrio navarrensis- 454; unprocessed data Metric2423-0108-24622541-902756-81 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
11
Vibrio vulnificus- Illumina; unprocessed data Metric2009V-136806-243208-243508-243907-2444 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
12
Vibrio navarrensis- Illumina; unprocessed data Metric2423-0108-24622541-902756-81 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
13
Per base sequence quality vul_454_07-2444nav_454_2541-90 vul_ill_06-2432 nav_ill_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
14
Per base sequence content vul_454_06-2432 vul_ill_06-2432nav_ill_06-2756-81 nav_454_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
15
Seq. duplicate levels vul_454_08-2435nav_454_2541-90 vul_ill_06-2432 nav_ill_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
16
Pre-processing stats ParameterValue Total sequences15,343,648 Good sequences9,775,116 Bad sequences5,568,532 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
17
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
18
Assemblers NamePlatformSource file InstallationUsage Allpaths LGIllumina SOAP DeNovoIllumina VelvetIllumina SUTTAHybrid RAYHybrid CLC genomics workbenchHybrid Newbler454 CABOG454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
19
CLC Genomics Word Size: Automatic Word Size CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads. Bubble Size: Automatic Bubble Size A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one. Minimum Contig Length: 200 Mismatch cost : 2 The cost of a mismatch between the read and the reference sequence. Insertion cost: 3 The cost of an insertion in the read (causing a gap in the reference sequence) Deletion cost: 3 The cost of having a gap in the read. The score for a match is always 1. Length fraction: 0.5 Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference sequence for the read to be included in the final mapping. Similarity: 0.8 Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final mapping, set this value to 0.9. Update contigs based on mapped reads This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
20
Velvet De brujin assembler Max kmer length-31, default 29 Commands velveth directory -k-mer -readtype –file format filename velvetg VAssemILL -exp_cov auto -cov_cutoff auto exp_cov – allow the sytem to infer expected coverage of unique regions Cov_cutoff - Allow the system to infer the removal of low coverage nodes Designed for very short reads (25-50bp) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
21
Newbler De Novo OLC assembler Uses k-mer based hashing Command – runAssembly [ filename ] Designed for longer reads (454) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
22
SOAP DeNovo2 Short reads DeNovo assembler Designed to study Illumina GAII contigs Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4 -N 4600000 -o test_OP 1>ass.log 2>ass.err Parameters specified: Insert_size: 0, single end reads Kmer_size: 23, default asm_flag: both contigs and scaffold Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
23
Assembler comparison- 454 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contigRead usage % CLC Genomics wb. 93,53636313,107NA 99.32 Newbler194,54014233,55094777,15698.9 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contig Read usage % CLC Genomics wb. 84,31331313,828NA 98.53 Newbler111,46234712,606168218,09197.88 nav_454_2541-90 vul_454_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
24
Assembler comparison- Illumina ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,07728,760184NA Velvet17,4081,4023,07299.2658,24692.09 CLC Genomics wb56,62829114,76699.36193,565NA ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,09426,773207NA Velvet15,6991,2533,75999.5751,34386.93 CLC Genomics wb87,29826018,08799.40233,510NA nav_ill_2541-90 vul_ill_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
25
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/454? reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
26
Reference Genomes V. vulnificus MO6-24/O V. vulnificus YJ016 V. vulnificus CMCP6
27
Reference vs. all contigs- 454 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb. (n = 313) 452541253925 Newbler (n = 347)592558254324 nav_454_2541-90 vul_454_06-2432 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs%- Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb.NA Newbler (n = 142)859184918692
28
Reference vs. all contigs- Illumina Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% SOAP DeNovo (n = 28,760) 313-3143 Velvet (n = 1402)202320232023 nav_ill_2541-90 vul_ill_06-2432 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases%- Aligned contigs% Aligned bases% SOAP DeNovo (n = 26,773) 187618761876 Velvet(n = 1,253)469147914791
29
Visualization Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
30
Road ahead….. Get all the tools working Optimize tool parameters Use Illumina reads to finish 454 contigs Performance considerations for the tool Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
31
Questions???
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.