Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Click to edit Master title style Irys data analysis January 10 th, 2014.
Lecture 14 Genome sequencing projects
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
De-novo Assembly Day 4.
Solanum lycopersicum Chromosome 4 Sequencing Update SOL Germany– October 2008 Wellcome Trust Medical Photographic Library.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
The New Zealand Institute for Plant & Food Research Limited Potato Genome Sequencing Consortium, notes from the edge Dr Susan Thomson, Dr Mark Fiers, Dr.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Fuzzypath – Algorithms, Applications and Future Developments
The Changing Face of Sequencing
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
FINISHING WORKSHOP APRIL 2008 CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966 T0731 TM15.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010.
Jan Pačes Institute of Molecular Genetics AS CR
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
Comparative analyses of the potato and tomato transcriptomes
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Solanum lycopersicum Chromosome 4 Mapping and Finishing Update SRC-UK and Wellcome Trust Sanger Institute SOL Korea – September 2007 Wellcome Trust Medical.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
1 Comparative analyses of the potato and tomato transcriptomes David Francis, AllenVan Deynze, John Hamilton, Walter De Jong, David Douches, Sanwen Huang,
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Professors: Dr. Gribskov and Dr. Weil
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Stuff to Do.
Plant & Animal Genome Conference
CSCI 1810 Computational Molecular Biology 2018
Sequence the 3 billion base pairs of human
Presentation transcript:

Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute

Phusion2 Assembly Pipeline NGS Data Assembly Contig Merge Filtering Unikalow Clustering Phusion2 Contig Generation Scaffolding Spinner Consensus Bases Smalt & Gap5 SOAPdenovo Fermi ABySS Mate Pair Reads 2k-40k Pair End Reads bp

ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2 iCAS – an Illumina Clone Assembly System

Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/ Data filtering using Unikalow

Assembly Method 1ACCTGATC 2CTGATCAA 3TGATCAAT 4AGCGATCA 5CGATCAAT 6GATCAATG 7TCAATGTG 8CAATGTGA 1. Overlap graph Sequencing reads: 2. de Bruijn graph 3. String graph

Scaffold Merge: Ref Contig Merge: Base Sup Ref Base Ctg ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/

Contig Consensus using Gap5

PacBio Capillary Illumina Can we really trust Single Molecule Sequencing?

CloneLengthSOAPABySSiCAS N50*Sub|IndN50*Sub|IndN50*Sub|IndUncov bE217O | | |2 (2)**12 bT237K | | |4 (19)**626 bE352A | | |14 (65)**23 bE367M | | |1 (20)**1487 bE378K | | |1 (10)**741 fSS328I | | |00 fSS404B | | |00 fSY5K | | |00 Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39

Spinner – a scaffolding tool Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig. ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/

Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

_________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_SizeN50 AverageN50 Average Assemblathon Mb 608Kb86.8Kb11Mb 450Kb Grass Carp (F)900Mb2.3Mb Mb17.1Kb Grass Carp (M)1000MB0.34Mb11.2Kb2.27Mb8.2Kb Bamboo 2.0 Gb322Kb Kb 7689 Parrot1.23 Gb906Kb Mb 6969 ________________________________________________________ Spinner vs SSPACE

Grass Phylogeny

G s = (K n – K s )/D = 1.97x10 9 K n = 80.5x10 9 – Total number of kmer words; Ks = 9.5x Number of single copy kmer words; D = 36 - Depth of kmer occurrence Bamboo Genome: Size Estimation

Solexa reads : Number of read pairs: 877 Million; Finished genome size: 2.0 GB; Read length:2x100bp; Estimated read coverage: ~90X; Insert size: 500/ bp; Mate pair data:3k,5k,7k,8k,10k,20k Number of reads clustered:757 Million Assembly features: - stats Contigs Scaffolds Total number of contigs: 744, ,278 Total bases of contigs: 1.86 Gb2.05 Gb N50 contig size: 11,622328,698 Largest contig:188,1634,869,017 Averaged contig size: 2,5007,400 Contig coverage on genome: ~90%>95% Bamboo Genome Assembly

Assemblies by pure SOAPdenovo Assemblies by SOAPdenovo & Abyss Rate of single-base difference (# per Kb) Rate of insertion and deletion (# per Kb) Coverage by initial contigs Coverage by supercontigs Bamboo Genome Assembly QC using Finished BACs

Evolution of the Wheat Genome

Size of the Wheat Genome: 17Gb

International Wheat Genome Sequencing Consortium

WHEjyyDADDBAAPE167 WHEjjzDADDCBAPE199 WHEjjzDADDCCAPE223 WHEjjzDADDCABPE230 WHEjyyDAEDDAAPE250 WHEjyyDAEDDABPE250 WHEjyyDAEDDBAPE250 WHEjyyDAEDDBBPE250 WHEjyyDAEDDCAPE250 WHEjyyDAEDDCBPE250 WHEjyyDAEDDDAPE250 WHEjjzDADDCACPE254 WHEjyyDAEDIAAPE500 WHEjyyDAEDIBAPE500 WHEjyyDADDIAAPE502 WHEjyyDADDIDAPE510 WHEjyyDADDICAPE527 WHEjyyDADDIBAPE532 WHEjyyDADDIBBPE551 WHEjyyDADDKAAPE682 WHEjyyDADDMBAPE706 WHEjyyDADDKCAPE725 WHEjyyDADDMAAPE764 WHEjyyDAADWAAPE2000 WHEjyyDAADWBAPE2000 WHEjyyDAADWCAPE2000 WHEjyyDAADWDAPE2000 WHEjyyDACDWAAPE2002 WHEjyyDAEDWAAPE2008 WHEjyyDACDWBBPE2500 WHEjyyDAADLAAPE5000 WHEjyyDAADLBAPE5000 WHEjyyDAADLBBPE5000 WHEjyyDAEDLAAPE5004 WHEjjzDADLBBPE8300 WHEjyyDAADTAAPE10000 WHEjyyDABDTAAPE10000 WHEjyyDADDTAAPE10000 WHEjyyDADDTBBPE10000 WHEjyyDAIDUAAPE20000 Sequencing of D Genome Libraries & Insert Sizes

G s = (K n – K s )/D = 4.2x10 9 K n = 59.8x10 9 – Total number of kmer words; Ks = 4.3x Number of single copy kmer words; D = 13 - Depth of kmer occurrence D Genome: Size Estimation

Solexa reads : Number of read pairs: 805 Million; Estimated genome size: 4.2 GB; Read length:45-95bp; Estimated read coverage: ~40X; Insert size: bp; Mate pair data:2k - 20k Number of reads clustered:558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig:86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80% Wheat D Genome Assembly

55,277130, Gb0.97Gb 40,35318, Mb2.27Mb Grass carp(F&M) Miscanthus Wild rice

Acknowledgements:  Joe Henson  German Tischler  Andrew Whitwham  Chinese Academy of Agricultural Sciences Jizeng Jia Guangyue Zhao  National Gene Research Centre, Chinese Academy of Sciences Han Bin Hengyun Lu