Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.

Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:  Projects before Bioinformatics  Bioinformatics Projects Involved  Variation Detection SNP, Indel, CNVs etc  Fuzzypath – short read assembly  Extremely GC Biased Genomes

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN

 SSAHA (Sequence Search and Alignment by the Hashing Algorithm Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_SV – Structural variation (CNVs) detection ssaha_pileup – SNP/indel detection from next-gen data  Phusion Development and maintenance of the pipeline Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes  TraceSeach Public sequence search facility for all the traces  Fuzzypath Short read assembler Informatics Projects Involved

Read mapping by hashing and dynamic programming data base of subject sequences FASTQ file with query sequences banded Smith-Waterman alignment

Pipeline of ssaha_pileup Sequencing Reads SNP File Ssaha_cigar Alignment - ssaha2 Unique placed cigar read file SE Reference fasta Pileup/cons PE Ssaha_pairs Ssaha_clean Ssaha_indelssaha_pileup Indel File

Mapping Score in ssaha2 Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 S map = 50 is the mapping score: R = read length; S max - maximum alignment score (smith-waterman) of the hits on genome; S max2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with S max 30; Second best hit: one base mismatch with S max2 29. The mapping score for this read is S map = 10; Read Reference 29 21 30 2514 27

SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: S map - read mapping score, from 0 (repeat) to 50 (unique); F q - base quality factor: F q = 1 if Q>=30 F q = 0.5 if Q =30 F q = 0.5 if Q<30; N – number of read coverage at the location. SNP Confidence Score in ssaha2

Getting Personal with J. Craig Venter and James Watson

Datasets n Venter: ABI capillary reads –Celera: 19,397,599 55% in pairs –JCVI: 12,541,352 98% in pairs –Total: 31,938,95172% in pairs n Watson: 454 GS FLX reads –Baylor & Roche 74,198,831 –single end reads with length 150 – 280 bps n Chromosome X Illumina reads –140 million paired Solexa reads at ~45x

IndividualsCount% dbSNP Venter SNP Calling (Capillary) Homozygous SNPs 1 347 806 97.1% Heterozygous SNPs 1 857 167 90.9% Total SNPs 3 204 973 93.5% Watson SNP Calling (454) Homozygous SNPs 1 298 309 93.0% Heterozygous SNPs 1 767 951 63.9% Total SNPs3 066 260 76.3% X Chromosome SNPs (Solexa) Homozygous SNPs 27 708 92.8% Heterozygous SNPs 63 197 81.8% Total SNPs 90 905 85.1% SNP Results from Three Individuals

Deletio n Insertion Reference Sequence Sample Reads VNTR A’’ A’ Insertion Sample Reads Reference Sequence        1 1’ 2’ 2 Sample Reads Detection of Structural Variations

Deletion VNTRs Insertion Total number: 2507 3775 1037 Maximum length (bp): 50000 4759 Minimum length (bp): 20 20 Average length (bp): 815 216 Affected Bases: 2043653 817930 Structural Variations against NCBI36 Deletion VNTR Insertion Total number: 1389 553 396 Maximum length (bp): 71832 9589 Minimum length (bp): 20 20 Average length (bp): 1252 270 Affected bases: 1740162 149421

Deletion – Size Distribution

VNTRs – Size Distribution

Simulated Solexa reads : Number of reads: 25,647,985 Genome size: 23.0 Mbp Read length:36 Read coverage:40x Num. of uniquely placed PE reads: 24,303,362 Percentage of placed PE reads:94.5% Num. of uniquely placed SE reads:23,229,651 Percentage of placed SE reads:90.6% Detection results: Number of deletions: 5,816 Number of detected deletions: 5,668(97.5%) Number of false positives:135 (2.3%) Number of insertions: 5,816 Number of detected insertions:5,458(93.8%) Number of false positives:15(0.26%) Indel Detection P.Faciparum 3D7 Simulations

Availability ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ http://www.sanger.ac.uk/Software/analysis/SSAHA2 More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pi leup-readme

FuzzyPath and Assemblies from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA

Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads 30-70 bp known dist ~500 bp 30-70 bp

Kmer Extension & Repeat Junctions

Handling of Single Base Variations

ACGTAACTAACAGTT 00 01 10 11 00 00 01 11 00 00 01 00 10 11 11 ACGTAACTCACAGTT 00 01 10 11 00 00 01 11 01 00 01 00 10 11 11 ACGTAACT ACAGTT 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 Fuzzy Kmers Number of Mismatches between Two Kmers

Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference - 454 or Sanger reads - 454 or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus

Pileup of Solexa and 454 Reads

Solexa reads: Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly

Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads

Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/50-300 bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly

Solexa reads : Number of reads: 6,346,317; Finished genome size: 4.7 Mbp; Read length:33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig stats Total number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig:401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors:2 Salmonella delhi5 Solexa Assembly Guided by A Close Reference

The Malaria Genome Project

library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x 769074.1221 PCR-free E. coli 0422 x 765735.3108 PCR-free P. falciparum 3D72 x 76148623.065 PCR-free B. pertussis ST242 x 364524.1110 PCR-free P. falciparum 3D72 x 36100823.044 PCR-free E. coli 0422 x 369585.3181 standard-245 P. falciparum 3D72 x 35219823.096 standard-368 P. falciparum 3D72 x 35262823.0115 standard-851 P. falciparum 3D72 x 3547423.021 standard-883 P. falciparum clin2 x 36399423.0175 Datasets with Various GC Content GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%

Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26,92622839 Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: 14561621 Largest contig:9106 9825 Averaged contig size: 706923 Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies

Acknowledgements:  Jim Mullikin  Tony Cox – Illumina, UK  Tony Cox – Sanger Institute  Adam Spargao,  Yong Gu  Ben Blackburne  Hannes Ponstingl  Daniel Turner  Michael Quail  Jane Rogers  Richard Durbin

Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations

Presentation on theme: "Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations

Presentation on theme: "Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

Similar presentations

About project

Feedback