FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute
Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp
Kmer Extension & Repeat Junctions
Handling of Single Base Variations
ACGTAACTAACAGTT ACGTAACTCACAGTT ACGTAACT ACAGTT Fuzzy Kmers Number of Mismatches between Two Kmers
Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference or Sanger reads or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus
Pileup of Solexa and 454 Reads
Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly
Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/ bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads
Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/ bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly
Solexa reads : Number of reads: 6,346,317; Finished genome size: 4.7 Mbp; Read length:33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig stats Total number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig:401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors:2 Salmonella delhi5 Solexa Assembly Guided by A Close Reference
The Malaria Genome Project
library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x PCR-free E. coli 0422 x PCR-free P. falciparum 3D72 x PCR-free B. pertussis ST242 x PCR-free P. falciparum 3D72 x PCR-free E. coli 0422 x standard-245 P. falciparum 3D72 x standard-368 P. falciparum 3D72 x standard-851 P. falciparum 3D72 x standard-883 P. falciparum clin2 x Datasets with Various GC Content GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%
Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26, Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies
Acknowledgements: Yong Gu Ben Blackburne Hannes Ponstingl Daniel Turner Michael Quail Tony Cox Richard Durbin