Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.

Similar presentations


Presentation on theme: "Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly."— Presentation transcript:

1 Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly

2 Outline  Input Data  Sequence read data  Pipeline Review  Un-processed data  Assemblers  Preliminary data – assembler comparison  Visualization  Future

3 Input Data V. navarrensisV. vulnificus 2423-012009V-1368 08-246206-2432 2541-9008-2435 2756-8108-2439 -07-2444 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

4 Vibrio navarrensis- 454 SequenceID2423-0108-24622541-902756-81 Min. Read Length 21 bp25 bp19 bp28 bp Max. Read Length 738 bp573 bp704 bp Avg. Read Length 423.27 (± 117.36 bp) 401.80 (± 117.12 bp) 416.23 (± 125.84 bp) 423.53(± 117.19 bp) Total Reads160,56013,854303,434218,021 Coverage15x1.23x28.06x20.51x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

5 Vibrio vulnificus- 454 SequenceID2009V-136806-243208-243508-243907-2444 Min. Read Length 26 bp21 bp23 bp22 bp18 bp Max. Read Length 593 bp597 bp723 bp594 bp736 bp Avg. Read Length 416.05(± 123.19 bp) 371.91(± 112.13bp) 416.98 (± 121.56 bp) 418.12 (± 120.88 bp) 368.78(± 115.96 bp) Total Reads191,280786,944352,726173,538777,228 Coverage17x65x32x16x63x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

6 Vibrio navarrensis- Illumina SequenceID2423-0108-24622541-902756-81 Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads19,316,65929,414,237126,298,69192,338,634 Coverage326x496x250x237x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

7 Vibrio vulnificus- Illumina SequenceID2009V-136806-243208-243508-243907-2444 Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads15,764,32914,562,25215,343,64816,007,89515,495,709 Coverage~250x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

8 454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

9 Vibrio vulnificus- 454 Metric13682432243524392444 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content

10 Vibrio navarrensis- 454; unprocessed data Metric2423-0108-24622541-902756-81 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

11 Vibrio vulnificus- Illumina; unprocessed data Metric2009V-136806-243208-243508-243907-2444 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

12 Vibrio navarrensis- Illumina; unprocessed data Metric2423-0108-24622541-902756-81 Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

13 Per base sequence quality vul_454_07-2444nav_454_2541-90 vul_ill_06-2432 nav_ill_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

14 Per base sequence content vul_454_06-2432 vul_ill_06-2432nav_ill_06-2756-81 nav_454_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

15 Seq. duplicate levels vul_454_08-2435nav_454_2541-90 vul_ill_06-2432 nav_ill_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

16 Pre-processing stats ParameterValue Total sequences15,343,648 Good sequences9,775,116 Bad sequences5,568,532 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

17 454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

18 Assemblers NamePlatformSource file InstallationUsage Allpaths LGIllumina SOAP DeNovoIllumina VelvetIllumina SUTTAHybrid RAYHybrid CLC genomics workbenchHybrid Newbler454 CABOG454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

19 CLC Genomics  Word Size: Automatic Word Size  CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads.  Bubble Size: Automatic Bubble Size  A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one.  Minimum Contig Length: 200  Mismatch cost : 2  The cost of a mismatch between the read and the reference sequence.  Insertion cost: 3  The cost of an insertion in the read (causing a gap in the reference sequence)  Deletion cost: 3  The cost of having a gap in the read. The score for a match is always 1.  Length fraction: 0.5  Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference sequence for the read to be included in the final mapping.  Similarity: 0.8  Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final mapping, set this value to 0.9.  Update contigs based on mapped reads  This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

20 Velvet  De brujin assembler  Max kmer length-31, default 29  Commands  velveth directory -k-mer -readtype –file format filename  velvetg VAssemILL -exp_cov auto -cov_cutoff auto  exp_cov – allow the sytem to infer expected coverage of unique regions  Cov_cutoff - Allow the system to infer the removal of low coverage nodes  Designed for very short reads (25-50bp) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

21 Newbler  De Novo OLC assembler  Uses k-mer based hashing  Command – runAssembly [ filename ]  Designed for longer reads (454) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

22 SOAP DeNovo2  Short reads DeNovo assembler  Designed to study Illumina GAII contigs  Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4 -N 4600000 -o test_OP 1>ass.log 2>ass.err  Parameters specified:  Insert_size: 0, single end reads  Kmer_size: 23, default  asm_flag: both contigs and scaffold Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

23 Assembler comparison- 454 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contigRead usage % CLC Genomics wb. 93,53636313,107NA 99.32 Newbler194,54014233,55094777,15698.9 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contig Read usage % CLC Genomics wb. 84,31331313,828NA 98.53 Newbler111,46234712,606168218,09197.88 nav_454_2541-90 vul_454_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

24 Assembler comparison- Illumina ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,07728,760184NA Velvet17,4081,4023,07299.2658,24692.09 CLC Genomics wb56,62829114,76699.36193,565NA ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,09426,773207NA Velvet15,6991,2533,75999.5751,34386.93 CLC Genomics wb87,29826018,08799.40233,510NA nav_ill_2541-90 vul_ill_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

25 454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/454? reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

26 Reference Genomes  V. vulnificus MO6-24/O  V. vulnificus YJ016  V. vulnificus CMCP6

27 Reference vs. all contigs- 454 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb. (n = 313) 452541253925 Newbler (n = 347)592558254324 nav_454_2541-90 vul_454_06-2432 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs%- Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb.NA Newbler (n = 142)859184918692

28 Reference vs. all contigs- Illumina Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% SOAP DeNovo (n = 28,760) 313-3143 Velvet (n = 1402)202320232023 nav_ill_2541-90 vul_ill_06-2432 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases%- Aligned contigs% Aligned bases% SOAP DeNovo (n = 26,773) 187618761876 Velvet(n = 1,253)469147914791

29 Visualization Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

30 Road ahead…..  Get all the tools working  Optimize tool parameters  Use Illumina reads to finish 454 contigs  Performance considerations for the tool Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

31 Questions???


Download ppt "Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly."

Similar presentations


Ads by Google