Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Advertisements

Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service group of M.Gromov Tatyana Popova R&D Centre.
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CS296-5 Genomes, Networks, and Cancer
CAP5510 – Bioinformatics Sequence Assembly
A Hybrid Assembly System in Zebrafish Pooled Clones
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
CSE 5290: Algorithms for Bioinformatics Fall 2011
DNA By: Mr. Kauffman.
Graph Algorithms in Bioinformatics
Molecular engineering of photoresponsive three-dimensional DNA
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:  Euler Path and Sequence Reconstruction  Euler Hash Table  Read Extension  Using Base Qualities and Read Pairs  Repeat Junctions and Single Base Variation  Assembly Results  Future Work

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA

Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

Point to the Next - Hash Table Links S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC) Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19

Repeat Sequence Repeat Graph reads

Assembly Strategy Extend Solexa reads to long reads of 1-2 Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp

Kmer Extension & Walk

Quality Filters on Junctions

True Repeat Junctions

All Low Base Quality Case

Repetitive Contig and Read Pairs Depth For each hit read in the contig, contig index and offset are stored. Insert length Current read position Contig start Pair read position Depth

Read Pairs to Resolve Repeat Junctions

Handling of Repeat Junctions

Handling of Single Base Variations

Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~40X; Estimated Kmer coverage:14X; Number of vector reads:?; Assembly features: - contig stats Total number of contigs: 362; Total bases of contigs: 1,938,732 bp N50 contig size: 10,849; Largest contig:33,388 Averaged contig size: 5,356; Contig coverage over the genome: ~97 %; Contig extension errors: 1 Mis-assembly errors:3 S Suis P1/7 Solexa Assembly

Shredded reads : Number of reads: 1,338,161; Finished genome size: 2,007,491 bp; Read length:36; Estimated read coverage: 24X; Insert size:500 bp; Assembly features: Paired _Data Not_Paired Number of contigs: Total assembled bases: Mb1.956 Mb N50 contig size: 243,03913,929 Largest contig: 474,070 33,460 Averaged contig size: 57,0436,168 Contig coverage: >99.0 %>99.0 % Contig extension errors: 0 0 Mis-assembly errors: 32 S Suis P1/7 Shredded Read Assembly

Solexa reads : Number of reads: 5,142,190; Finished genome size: 4,809,037 bp; Read length:41; Estimated read coverage: ~15X; Assembly features: - contig stats Total number of contigs: 3,126; Total bases of contigs: 4,633,241 bp N50 contig size: 2,460; Largest contig:15,325; Averaged contig size: 1,482; Contig coverage over the genome: ~97.5 %; Mis-assembly errors:0 STyphi 6979 Solexa Assembly

Solexa reads : Number of reads: 4,808,788; Finished genome size: 4,809,037 bp; Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 65; Total bases of contigs: 4,800,992 bp N50 contig size: 158,460; Largest contig:489,849; Averaged contig size: 73,861; Contig coverage over the genome: ~99.0 %; Mis-assembly errors:3 STyphi CT18 Shredded Read Assembly

Solexa reads : Number of reads: 11,630,428; Finished genome size: 23.5 Mp; Read length:40; Estimated read coverage: 20X; Assembly features: - contig stats Total number of contigs: 29,313; Total bases of contigs: Mp N50 contig size: 1,355; Largest contig:14,136; Averaged contig size: 585; Contig coverage over the genome: ~72.8 %; Mis-assembly errors:? PF_3D7 Shredded Read Assembly

Clone Level Assembly with Shredded Error Free Reads Shred reads with given coverage Genome/Chromosome Organize reads into small groups covering clone 200 kb forward-reverse paired reads ~40 bp known dist ~500 bp ~40 bp

Human Chromosome X Shredded reads : Number of reads:156 million Chromosome length:156 Mb Number of Clones:774 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 28,204; Total bases of contigs: 148 Mp N50 contig size: 30,968; Largest contig:173,157; Averaged contig size: 5,254;

Zebrafish Chromosome 5 Shredded reads : Number of reads:70.2 million Chromosome length:70.3 Mb Number of Clones:351 Read length:40; Estimated read coverage: 40X; Assembly features: - contig stats Total number of contigs: 22,405; Total bases of contigs: 67.5 Mp N50 contig size: 9,587; Largest contig:70,757; Averaged contig size: 3,012;

Plasmodium Chr14 Shredded reads : Number of reads:3.2 million Chromosome length:3.29 Mb Number of Clones:16 Read length:40; Estimated read coverage: 40X; Assembly features: - Original data Total number of contigs: 1,960; Total bases of contigs: 2.86 Mp N50 contig size: 2,924; Largest contig:18,366; Averaged contig size: 1,461; Assembly features: - Replacing “TATATA…” Total number of contigs: 1,333; Total bases of contigs: 3.05 Mp N50 contig size: 4,596; Largest contig:23,345; Averaged contig size: 2,287;

Acknowledgements:  Ian Goodhead and Chris Clee  James Bonfield  Yong Gu and Adam Spargo  Daniel Zerbino (EBI)  Tony Cox  Richard Durbin