Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Next Generation Sequencing, Assembly, and Alignment Methods
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Genome sequencing and assembling
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
Next generation sequencing Xusheng Wang 4/29/2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Fuzzypath – Algorithms, Applications and Future Developments
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Human Genome.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CAP5510 – Bioinformatics Sequence Assembly
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Graph Algorithms in Bioinformatics
Jin Zhang, Jiayin Wang and Yufeng Wu
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:  Projects before Bioinformatics  Bioinformatics Projects Involved  Variation Detection SNP, Indel, CNVs etc  Fuzzypath – short read assembly  Extremely GC Biased Genomes

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN

 SSAHA (Sequence Search and Alignment by the Hashing Algorithm Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_SV – Structural variation (CNVs) detection ssaha_pileup – SNP/indel detection from next-gen data  Phusion Development and maintenance of the pipeline Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes  TraceSeach Public sequence search facility for all the traces  Fuzzypath Short read assembler Informatics Projects Involved

Read mapping by hashing and dynamic programming data base of subject sequences FASTQ file with query sequences banded Smith-Waterman alignment

Pipeline of ssaha_pileup Sequencing Reads SNP File Ssaha_cigar Alignment - ssaha2 Unique placed cigar read file SE Reference fasta Pileup/cons PE Ssaha_pairs Ssaha_clean Ssaha_indelssaha_pileup Indel File

Mapping Score in ssaha2 Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 S map = 50 is the mapping score: R = read length; S max - maximum alignment score (smith-waterman) of the hits on genome; S max2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with S max 30; Second best hit: one base mismatch with S max2 29. The mapping score for this read is S map = 10; Read Reference

SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: S map - read mapping score, from 0 (repeat) to 50 (unique); F q - base quality factor: F q = 1 if Q>=30 F q = 0.5 if Q =30 F q = 0.5 if Q<30; N – number of read coverage at the location. SNP Confidence Score in ssaha2

Getting Personal with J. Craig Venter and James Watson

Datasets n Venter: ABI capillary reads –Celera: 19,397,599 55% in pairs –JCVI: 12,541,352 98% in pairs –Total: 31,938,95172% in pairs n Watson: 454 GS FLX reads –Baylor & Roche 74,198,831 –single end reads with length 150 – 280 bps n Chromosome X Illumina reads –140 million paired Solexa reads at ~45x

IndividualsCount% dbSNP Venter SNP Calling (Capillary) Homozygous SNPs % Heterozygous SNPs % Total SNPs % Watson SNP Calling (454) Homozygous SNPs % Heterozygous SNPs % Total SNPs % X Chromosome SNPs (Solexa) Homozygous SNPs % Heterozygous SNPs % Total SNPs % SNP Results from Three Individuals

Deletio n Insertion Reference Sequence Sample Reads VNTR A’’ A’ Insertion Sample Reads Reference Sequence        1 1’ 2’ 2 Sample Reads Detection of Structural Variations

Deletion VNTRs Insertion Total number: Maximum length (bp): Minimum length (bp): Average length (bp): Affected Bases: Structural Variations against NCBI36 Deletion VNTR Insertion Total number: Maximum length (bp): Minimum length (bp): Average length (bp): Affected bases:

Deletion – Size Distribution

VNTRs – Size Distribution

Simulated Solexa reads : Number of reads: 25,647,985 Genome size: 23.0 Mbp Read length:36 Read coverage:40x Num. of uniquely placed PE reads: 24,303,362 Percentage of placed PE reads:94.5% Num. of uniquely placed SE reads:23,229,651 Percentage of placed SE reads:90.6% Detection results: Number of deletions: 5,816 Number of detected deletions: 5,668(97.5%) Number of false positives:135 (2.3%) Number of insertions: 5,816 Number of detected insertions:5,458(93.8%) Number of false positives:15(0.26%) Indel Detection P.Faciparum 3D7 Simulations

Availability ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pi leup-readme

FuzzyPath and Assemblies from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA

Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp

Kmer Extension & Repeat Junctions

Handling of Single Base Variations

ACGTAACTAACAGTT ACGTAACTCACAGTT ACGTAACT ACAGTT Fuzzy Kmers Number of Mismatches between Two Kmers

Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference or Sanger reads or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus

Pileup of Solexa and 454 Reads

Solexa reads: Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly

Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/ bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads

Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/ bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly

Solexa reads : Number of reads: 6,346,317; Finished genome size: 4.7 Mbp; Read length:33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig stats Total number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig:401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors:2 Salmonella delhi5 Solexa Assembly Guided by A Close Reference

The Malaria Genome Project

library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x PCR-free E. coli 0422 x PCR-free P. falciparum 3D72 x PCR-free B. pertussis ST242 x PCR-free P. falciparum 3D72 x PCR-free E. coli 0422 x standard-245 P. falciparum 3D72 x standard-368 P. falciparum 3D72 x standard-851 P. falciparum 3D72 x standard-883 P. falciparum clin2 x Datasets with Various GC Content GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%

Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26, Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies

Acknowledgements:  Jim Mullikin  Tony Cox – Illumina, UK  Tony Cox – Sanger Institute  Adam Spargao,  Yong Gu  Ben Blackburne  Hannes Ponstingl  Daniel Turner  Michael Quail  Jane Rogers  Richard Durbin