Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute
Outline of the Talk: Euler Path and Sequence Reconstruction Euler Hash Table Read Extension Using Base Qualities and Read Pairs Repeat Junctions and Single Base Variation Assembly Results Future Work
Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.
Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA
Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
Point to the Next - Hash Table Links S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC) Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19
Repeat Sequence Repeat Graph reads
Assembly Strategy Extend Solexa reads to long reads of 1-2 Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp
Kmer Extension & Walk
Quality Filters on Junctions
True Repeat Junctions
All Low Base Quality Case
Repetitive Contig and Read Pairs Depth For each hit read in the contig, contig index and offset are stored. Insert length Current read position Contig start Pair read position Depth
Read Pairs to Resolve Repeat Junctions
Handling of Repeat Junctions
Handling of Single Base Variations
Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~40X; Estimated Kmer coverage:14X; Number of vector reads:?; Assembly features: - contig stats Total number of contigs: 362; Total bases of contigs: 1,938,732 bp N50 contig size: 10,849; Largest contig:33,388 Averaged contig size: 5,356; Contig coverage over the genome: ~97 %; Contig extension errors: 1 Mis-assembly errors:3 S Suis P1/7 Solexa Assembly
Shredded reads : Number of reads: 1,338,161; Finished genome size: 2,007,491 bp; Read length:36; Estimated read coverage: 24X; Insert size:500 bp; Assembly features: Paired _Data Not_Paired Number of contigs: Total assembled bases: Mb1.956 Mb N50 contig size: 243,03913,929 Largest contig: 474,070 33,460 Averaged contig size: 57,0436,168 Contig coverage: >99.0 %>99.0 % Contig extension errors: 0 0 Mis-assembly errors: 32 S Suis P1/7 Shredded Read Assembly
Solexa reads : Number of reads: 5,142,190; Finished genome size: 4,809,037 bp; Read length:41; Estimated read coverage: ~15X; Assembly features: - contig stats Total number of contigs: 3,126; Total bases of contigs: 4,633,241 bp N50 contig size: 2,460; Largest contig:15,325; Averaged contig size: 1,482; Contig coverage over the genome: ~97.5 %; Mis-assembly errors:0 STyphi 6979 Solexa Assembly
Solexa reads : Number of reads: 4,808,788; Finished genome size: 4,809,037 bp; Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 65; Total bases of contigs: 4,800,992 bp N50 contig size: 158,460; Largest contig:489,849; Averaged contig size: 73,861; Contig coverage over the genome: ~99.0 %; Mis-assembly errors:3 STyphi CT18 Shredded Read Assembly
Solexa reads : Number of reads: 11,630,428; Finished genome size: 23.5 Mp; Read length:40; Estimated read coverage: 20X; Assembly features: - contig stats Total number of contigs: 29,313; Total bases of contigs: Mp N50 contig size: 1,355; Largest contig:14,136; Averaged contig size: 585; Contig coverage over the genome: ~72.8 %; Mis-assembly errors:? PF_3D7 Shredded Read Assembly
Clone Level Assembly with Shredded Error Free Reads Shred reads with given coverage Genome/Chromosome Organize reads into small groups covering clone 200 kb forward-reverse paired reads ~40 bp known dist ~500 bp ~40 bp
Human Chromosome X Shredded reads : Number of reads:156 million Chromosome length:156 Mb Number of Clones:774 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 28,204; Total bases of contigs: 148 Mp N50 contig size: 30,968; Largest contig:173,157; Averaged contig size: 5,254;
Zebrafish Chromosome 5 Shredded reads : Number of reads:70.2 million Chromosome length:70.3 Mb Number of Clones:351 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 22,405; Total bases of contigs: 67.5 Mp N50 contig size: 9,587; Largest contig:70,757; Averaged contig size: 3,012;
Plasmodium Chr14 Shredded reads : Number of reads:3.2 million Chromosome length:3.29 Mb Number of Clones:16 Read length:40; Estimated read coverage: 40X; Assembly features: - Original data Total number of contigs: 1,960; Total bases of contigs: 2.86 Mp N50 contig size: 2,924; Largest contig:18,366; Averaged contig size: 1,461; Assembly features: - Replacing “TATATA…” Total number of contigs: 1,333; Total bases of contigs: 3.05 Mp N50 contig size: 4,596; Largest contig:23,345; Averaged contig size: 2,287;
Acknowledgements: Ian Goodhead and Chris Clee James Bonfield Yong Gu and Adam Spargo Daniel Zerbino (EBI) Tony Cox Richard Durbin