Download presentation
Presentation is loading. Please wait.
Published byRudolph Nicholson Modified over 9 years ago
1
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute
2
Phusion2 Assembly Pipeline NGS Data Assembly Contig Merge Filtering Unikalow Clustering Phusion2 Contig Generation Scaffolding Spinner Consensus Bases Smalt & Gap5 SOAPdenovo Fermi ABySS Mate Pair Reads 2k-40k Pair End Reads 170-800bp
3
ftp://ftp.sanger.ac.uk/pub/badger/aw7/icas_v061.tar.bz2 iCAS – an Illumina Clone Assembly System
4
Unikalow: ftp://ftp.sanger.ac.uk/pub/zn1/unikalow/ Data filtering using Unikalow
5
Assembly Method 1ACCTGATC 2CTGATCAA 3TGATCAAT 4AGCGATCA 5CGATCAAT 6GATCAATG 7TCAATGTG 8CAATGTGA 1. Overlap graph Sequencing reads: 2. de Bruijn graph 3. String graph
6
Scaffold Merge: Ref Contig Merge: Base Sup Ref Base Ctg ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/
7
Contig Consensus using Gap5
9
PacBio Capillary Illumina Can we really trust Single Molecule Sequencing?
10
CloneLengthSOAPABySSiCAS N50*Sub|IndN50*Sub|IndN50*Sub|IndUncov bE217O41869455986311|101092350|21092350|2 (2)**12 bT237K121304621371757|32233868|4472058|4 (19)**626 bE352A131538753124741|23930108|151325928|14 (65)**23 bE367M1415428810508340|9314051|11073940|1 (20)**1487 bE378K2120785017304711|105424023|51873960|1 (10)**741 fSS328I242036420873|5126281|0420470|00 fSS404B1432829195430|3290983|1328320|00 fSY5K1041286413520|3412960|0412960|00 Clone Assemblies vs Assemblers 5 BAC clones and 3 fosmids Clone coverage: 99.7%; Base quality: Q39
11
Spinner – a scaffolding tool Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig. ftp://ftp.sanger.ac.uk/pub/users/zn1/spinner/
12
Spinner – walks through a loop These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
13
_________________________________________________________ SSPACESPINNER _________________________________________________________ Genome_SizeN50 AverageN50 Average Assemblathon 1 119 Mb 608Kb86.8Kb11Mb 450Kb Grass Carp (F)900Mb2.3Mb14.45.85Mb17.1Kb Grass Carp (M)1000MB0.34Mb11.2Kb2.27Mb8.2Kb Bamboo 2.0 Gb322Kb 7404488Kb 7689 Parrot1.23 Gb906Kb 46751.32Mb 6969 ________________________________________________________ Spinner vs SSPACE
14
Grass Phylogeny
15
G s = (K n – K s )/D = 1.97x10 9 K n = 80.5x10 9 – Total number of kmer words; Ks = 9.5x10 9 - Number of single copy kmer words; D = 36 - Depth of kmer occurrence Bamboo Genome: Size Estimation
16
Solexa reads : Number of read pairs: 877 Million; Finished genome size: 2.0 GB; Read length:2x100bp; Estimated read coverage: ~90X; Insert size: 500/50-600 bp; Mate pair data:3k,5k,7k,8k,10k,20k Number of reads clustered:757 Million Assembly features: - stats Contigs Scaffolds Total number of contigs: 744,286 277,278 Total bases of contigs: 1.86 Gb2.05 Gb N50 contig size: 11,622328,698 Largest contig:188,1634,869,017 Averaged contig size: 2,5007,400 Contig coverage on genome: ~90%>95% Bamboo Genome Assembly
17
Assemblies by pure SOAPdenovo Assemblies by SOAPdenovo & Abyss Rate of single-base difference (# per Kb)2.280.43 Rate of insertion and deletion (# per Kb)0.820.19 Coverage by initial contigs0.760.85 Coverage by supercontigs0.910.94 Bamboo Genome Assembly QC using Finished BACs
20
Evolution of the Wheat Genome
21
Size of the Wheat Genome: 17Gb
22
International Wheat Genome Sequencing Consortium
23
WHEjyyDADDBAAPE167 WHEjjzDADDCBAPE199 WHEjjzDADDCCAPE223 WHEjjzDADDCABPE230 WHEjyyDAEDDAAPE250 WHEjyyDAEDDABPE250 WHEjyyDAEDDBAPE250 WHEjyyDAEDDBBPE250 WHEjyyDAEDDCAPE250 WHEjyyDAEDDCBPE250 WHEjyyDAEDDDAPE250 WHEjjzDADDCACPE254 WHEjyyDAEDIAAPE500 WHEjyyDAEDIBAPE500 WHEjyyDADDIAAPE502 WHEjyyDADDIDAPE510 WHEjyyDADDICAPE527 WHEjyyDADDIBAPE532 WHEjyyDADDIBBPE551 WHEjyyDADDKAAPE682 WHEjyyDADDMBAPE706 WHEjyyDADDKCAPE725 WHEjyyDADDMAAPE764 WHEjyyDAADWAAPE2000 WHEjyyDAADWBAPE2000 WHEjyyDAADWCAPE2000 WHEjyyDAADWDAPE2000 WHEjyyDACDWAAPE2002 WHEjyyDAEDWAAPE2008 WHEjyyDACDWBBPE2500 WHEjyyDAADLAAPE5000 WHEjyyDAADLBAPE5000 WHEjyyDAADLBBPE5000 WHEjyyDAEDLAAPE5004 WHEjjzDADLBBPE8300 WHEjyyDAADTAAPE10000 WHEjyyDABDTAAPE10000 WHEjyyDADDTAAPE10000 WHEjyyDADDTBBPE10000 WHEjyyDAIDUAAPE20000 Sequencing of D Genome Libraries & Insert Sizes
24
G s = (K n – K s )/D = 4.2x10 9 K n = 59.8x10 9 – Total number of kmer words; Ks = 4.3x10 9 - Number of single copy kmer words; D = 13 - Depth of kmer occurrence D Genome: Size Estimation
25
Solexa reads : Number of read pairs: 805 Million; Estimated genome size: 4.2 GB; Read length:45-95bp; Estimated read coverage: ~40X; Insert size: 167-800 bp; Mate pair data:2k - 20k Number of reads clustered:558 Million Assembly features: - stats Contigs Total number of contigs: 3,228,623 Total bases of contigs: 3.34 Gb N50 contig size: 3,084 Largest contig:86,064 Averaged contig size: 1,035 Contig coverage on genome: ~80% Wheat D Genome Assembly
26
55,277130,221 0.88 Gb0.97Gb 40,35318,252 5.89 Mb2.27Mb Grass carp(F&M) Miscanthus Wild rice
27
Acknowledgements: Joe Henson German Tischler Andrew Whitwham Chinese Academy of Agricultural Sciences Jizeng Jia Guangyue Zhao National Gene Research Centre, Chinese Academy of Sciences Han Bin Hengyun Lu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.