Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly 1
Can synteny help? And How? Scaffolding Contig gap closure
RACA - Reference-assisted chromosome assembly
Q = scaff(i)*232 + contig_loci(j) Lattice of Target - Reference Target sequence Reference Scaffold 1 Scaffold 2 Scaffold 3 Q = scaff(i)*232 + contig_loci(j) Lattice of Target - Reference
After Noise Cleaning Gap_size = Y - X Scaffold 3 Scaffold 2 Scaffold 1 Target sequence Reference Scaffold 1 After Noise Cleaning Gap_size = Y - X Y Scaffold 3 X Scaffold 2
Cases Shouldn’t Join Reference Target Reference Target Gap_size Scaffold 1 Scaffold 2 Reference Target Gap_size Scaffold 1 Scaffold 2
GAGE: Human Chr14 and RACA using Orangutan Assembler N_bases N_scaffs N50 (Mb) Original 88.8 418 81.6 Allpahts-LG RACA 86.8 Cross_genome 89 221 85.5 78.6 1472 0.37 Bambus2 72.1 1094 13.7 86.5 498 0.4 CABOG 81.4 86.3 46 89.7 0.88 MSR-CA 83.4 89.6 94.7 30975 0.075 SGA 57.4 94.8 29662 77.3 108 38477 0.453 SOAPdenovo 84.4 102.8 12955 78.9 143.8 61455 0.84 Velvet 123 139.4 3278 8.71
Scaffold N50 for Other Genome Assemblies Original Cross_g References Panda 1.3Mb 25Mb Dog, Human Tibetan Antelope 2.6Mb 42Mb Cattle, Dog, Human Tasmanian Devil 1.8Mb 6.8Mb Opossum Availability ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/cross_genome/
Improve gorilla assembly using human reference Contig gap size re-estimation Improve gorilla assembly using human reference Combined Gorilla-Human Assembly Read Alignment Pair-wise/Multiple Read Clustering Local Assembly Final Gorilla Assembly
Re-estimate Contig Gap Sizes from Reference New gap size Local assembly based on clustered reads Ref seq inserted Gap size New gap size Target sequence Reference sequence
Assemblies using Synteny-guided Method Gorilla Genome - Real Data Human Chr6 - Simulation Gorilla Genome - Real Data Reads: 2x100 with 500bp insert 60X Original Assembly Contig N50 24.3kb 13.5kb Average contig length 6850bp 6940bp N of clusters (100000 pairs) 504 5807 43.7kb 24.0kb Gap closed 7809 10433 N of base errors in gap closed regions 256 subs and 12 indels (24bps) N/A
Gorilla - Merge with other De novo Assemblies Original assembly (dev5) Merge with Fermi* Merge with Masurca+ Contig N50 13.5kb 30.2kb 53.1kb Average length 6850 12577 18768 Largest contig 215kb 391.2kb 448.8kb N of gaps closed 182661 257167 *Fermi assembler: https://github.com/lh3/fermi/ +Masurca assembler: http://www.genome.umd.edu/masurca.html
Gs = (Kn – Ks)/D = 4.5x109 Kn = 125.4x109 – Total number of kmer words; Ks = 2.4x109 - Number of single copy kmer words; D = 27 - Depth of kmer occurrence
Original Contig (query) against New Assembly after Contig Break
Alignment Inconsistency
Original Contig (query) against New Assembly after Contig Break
Alignment Inconsistency
The Gorilla Assemblies Original New Total number of contigs: 464,875 285,139 N50 contig size: 11.7kb 23.9kb Largest contig: 191,556 322,733 Averaged contig size: 6085 9928
Acknowledgements: Hanness Ponstingl Frank Liu – Nanjing University of Information Technology (NUIT) Yan Li – (NUIT) Gorilla genome sequencing data BGI – Panda and Tibetan Antelope assemblies