Presentation is loading. Please wait.

Presentation is loading. Please wait.

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations


Presentation on theme: "Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

1

2 Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

3 Outline of the Talk:  Euler Path and Sequence Reconstruction  Euler Hash Table  Read Extension  Using Base Qualities and Read Pairs  Repeat Junctions and Single Base Variation  Assembly Results  Future Work

4 Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

5 Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA

6 Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

7 Point to the Next - Hash Table Links S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC) Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19

8 Repeat Sequence Repeat Graph reads

9 Assembly Strategy Extend Solexa reads to long reads of 1-2 Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads 30-40 bp known dist ~500 bp 30-40 bp

10 Kmer Extension & Walk

11 Quality Filters on Junctions

12 True Repeat Junctions

13 All Low Base Quality Case

14 Repetitive Contig and Read Pairs Depth For each hit read in the contig, contig index and offset are stored. Insert length Current read position Contig start Pair read position Depth

15 Read Pairs to Resolve Repeat Junctions

16

17 Handling of Repeat Junctions

18 Handling of Single Base Variations

19 Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~40X; Estimated Kmer coverage:14X; Number of vector reads:?; Assembly features: - contig stats Total number of contigs: 362; Total bases of contigs: 1,938,732 bp N50 contig size: 10,849; Largest contig:33,388 Averaged contig size: 5,356; Contig coverage over the genome: ~97 %; Contig extension errors: 1 Mis-assembly errors:3 S Suis P1/7 Solexa Assembly

20 Shredded reads : Number of reads: 1,338,161; Finished genome size: 2,007,491 bp; Read length:36; Estimated read coverage: 24X; Insert size:500 bp; Assembly features: Paired _Data Not_Paired Number of contigs: 35317 Total assembled bases: 1.996 Mb1.956 Mb N50 contig size: 243,03913,929 Largest contig: 474,070 33,460 Averaged contig size: 57,0436,168 Contig coverage: >99.0 %>99.0 % Contig extension errors: 0 0 Mis-assembly errors: 32 S Suis P1/7 Shredded Read Assembly

21 Solexa reads : Number of reads: 5,142,190; Finished genome size: 4,809,037 bp; Read length:41; Estimated read coverage: ~15X; Assembly features: - contig stats Total number of contigs: 3,126; Total bases of contigs: 4,633,241 bp N50 contig size: 2,460; Largest contig:15,325; Averaged contig size: 1,482; Contig coverage over the genome: ~97.5 %; Mis-assembly errors:0 STyphi 6979 Solexa Assembly

22

23 Solexa reads : Number of reads: 4,808,788; Finished genome size: 4,809,037 bp; Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 65; Total bases of contigs: 4,800,992 bp N50 contig size: 158,460; Largest contig:489,849; Averaged contig size: 73,861; Contig coverage over the genome: ~99.0 %; Mis-assembly errors:3 STyphi CT18 Shredded Read Assembly

24 Solexa reads : Number of reads: 11,630,428; Finished genome size: 23.5 Mp; Read length:40; Estimated read coverage: 20X; Assembly features: - contig stats Total number of contigs: 29,313; Total bases of contigs: 17.17 Mp N50 contig size: 1,355; Largest contig:14,136; Averaged contig size: 585; Contig coverage over the genome: ~72.8 %; Mis-assembly errors:? PF_3D7 Shredded Read Assembly

25 Clone Level Assembly with Shredded Error Free Reads Shred reads with given coverage Genome/Chromosome Organize reads into small groups covering clone 200 kb forward-reverse paired reads ~40 bp known dist ~500 bp ~40 bp

26 Human Chromosome X Shredded reads : Number of reads:156 million Chromosome length:156 Mb Number of Clones:774 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 28,204; Total bases of contigs: 148 Mp N50 contig size: 30,968; Largest contig:173,157; Averaged contig size: 5,254;

27 Zebrafish Chromosome 5 Shredded reads : Number of reads:70.2 million Chromosome length:70.3 Mb Number of Clones:351 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 22,405; Total bases of contigs: 67.5 Mp N50 contig size: 9,587; Largest contig:70,757; Averaged contig size: 3,012;

28 Plasmodium Chr14 Shredded reads : Number of reads:3.2 million Chromosome length:3.29 Mb Number of Clones:16 Read length:40; Estimated read coverage: 40X; Assembly features: - Original data Total number of contigs: 1,960; Total bases of contigs: 2.86 Mp N50 contig size: 2,924; Largest contig:18,366; Averaged contig size: 1,461; Assembly features: - Replacing “TATATA…” Total number of contigs: 1,333; Total bases of contigs: 3.05 Mp N50 contig size: 4,596; Largest contig:23,345; Averaged contig size: 2,287;

29 Acknowledgements:  Ian Goodhead and Chris Clee  James Bonfield  Yong Gu and Adam Spargo  Daniel Zerbino (EBI)  Tony Cox  Richard Durbin


Download ppt "Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute."

Similar presentations


Ads by Google