Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute
Phusion2 Assembly Pipeline NGS Data Assembly Contig Merge Filtering Unikalow Clustering Phusion2 Contig Generation Scaffolding Spinner Consensus Bases Smalt & Gap5 SOAPdenovo Fermi ABySS Mate Pair Reads 2k-40k Pair End Reads bp
ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2 iCAS – an Illumina Clone Assembly System
Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/ Data filtering using Unikalow
Assembly Method 1ACCTGATC 2CTGATCAA 3TGATCAAT 4AGCGATCA 5CGATCAAT 6GATCAATG 7TCAATGTG 8CAATGTGA 1. Overlap graph Sequencing reads: 2. de Bruijn graph 3. String graph
Scaffold Merge: Ref Contig Merge: Base Sup Ref Base Ctg ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/
Contig Consensus using Gap5
PacBio Capillary Illumina Can we really trust Single Molecule Sequencing?
CloneLengthSOAPABySSiCAS N50*Sub|IndN50*Sub|IndN50*Sub|IndUncov bE217O | | |2 (2)**12 bT237K | | |4 (19)**626 bE352A | | |14 (65)**23 bE367M | | |1 (20)**1487 bE378K | | |1 (10)**741 fSS328I | | |00 fSS404B | | |00 fSY5K | | |00 Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39
Spinner – a scaffolding tool Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig. ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/
Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
_________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_SizeN50 AverageN50 Average Assemblathon Mb 608Kb86.8Kb11Mb 450Kb Grass Carp (F)900Mb2.3Mb Mb17.1Kb Grass Carp (M)1000MB0.34Mb11.2Kb2.27Mb8.2Kb Bamboo 2.0 Gb322Kb Kb 7689 Parrot1.23 Gb906Kb Mb 6969 ________________________________________________________ Spinner vs SSPACE
Grass Phylogeny
G s = (K n – K s )/D = 1.97x10 9 K n = 80.5x10 9 – Total number of kmer words; Ks = 9.5x Number of single copy kmer words; D = 36 - Depth of kmer occurrence Bamboo Genome: Size Estimation
Solexa reads : Number of read pairs: 877 Million; Finished genome size: 2.0 GB; Read length:2x100bp; Estimated read coverage: ~90X; Insert size: 500/ bp; Mate pair data:3k,5k,7k,8k,10k,20k Number of reads clustered:757 Million Assembly features: - stats Contigs Scaffolds Total number of contigs: 744, ,278 Total bases of contigs: 1.86 Gb2.05 Gb N50 contig size: 11,622328,698 Largest contig:188,1634,869,017 Averaged contig size: 2,5007,400 Contig coverage on genome: ~90%>95% Bamboo Genome Assembly
Assemblies by pure SOAPdenovo Assemblies by SOAPdenovo & Abyss Rate of single-base difference (# per Kb) Rate of insertion and deletion (# per Kb) Coverage by initial contigs Coverage by supercontigs Bamboo Genome Assembly QC using Finished BACs
Evolution of the Wheat Genome
Size of the Wheat Genome: 17Gb
International Wheat Genome Sequencing Consortium
WHEjyyDADDBAAPE167 WHEjjzDADDCBAPE199 WHEjjzDADDCCAPE223 WHEjjzDADDCABPE230 WHEjyyDAEDDAAPE250 WHEjyyDAEDDABPE250 WHEjyyDAEDDBAPE250 WHEjyyDAEDDBBPE250 WHEjyyDAEDDCAPE250 WHEjyyDAEDDCBPE250 WHEjyyDAEDDDAPE250 WHEjjzDADDCACPE254 WHEjyyDAEDIAAPE500 WHEjyyDAEDIBAPE500 WHEjyyDADDIAAPE502 WHEjyyDADDIDAPE510 WHEjyyDADDICAPE527 WHEjyyDADDIBAPE532 WHEjyyDADDIBBPE551 WHEjyyDADDKAAPE682 WHEjyyDADDMBAPE706 WHEjyyDADDKCAPE725 WHEjyyDADDMAAPE764 WHEjyyDAADWAAPE2000 WHEjyyDAADWBAPE2000 WHEjyyDAADWCAPE2000 WHEjyyDAADWDAPE2000 WHEjyyDACDWAAPE2002 WHEjyyDAEDWAAPE2008 WHEjyyDACDWBBPE2500 WHEjyyDAADLAAPE5000 WHEjyyDAADLBAPE5000 WHEjyyDAADLBBPE5000 WHEjyyDAEDLAAPE5004 WHEjjzDADLBBPE8300 WHEjyyDAADTAAPE10000 WHEjyyDABDTAAPE10000 WHEjyyDADDTAAPE10000 WHEjyyDADDTBBPE10000 WHEjyyDAIDUAAPE20000 Sequencing of D Genome Libraries & Insert Sizes
G s = (K n – K s )/D = 4.2x10 9 K n = 59.8x10 9 – Total number of kmer words; Ks = 4.3x Number of single copy kmer words; D = 13 - Depth of kmer occurrence D Genome: Size Estimation
Solexa reads : Number of read pairs: 805 Million; Estimated genome size: 4.2 GB; Read length:45-95bp; Estimated read coverage: ~40X; Insert size: bp; Mate pair data:2k - 20k Number of reads clustered:558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig:86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80% Wheat D Genome Assembly
55,277130, Gb0.97Gb 40,35318, Mb2.27Mb Grass carp(F&M) Miscanthus Wild rice
Acknowledgements: Joe Henson German Tischler Andrew Whitwham Chinese Academy of Agricultural Sciences Jizeng Jia Guangyue Zhao National Gene Research Centre, Chinese Academy of Sciences Han Bin Hengyun Lu