Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute
Outline of the Talk: Projects before Bioinformatics Bioinformatics Projects Involved Variation Detection SNP, Indel, CNVs etc Fuzzypath – short read assembly Extremely GC Biased Genomes
Powder Simulation
Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN
SSAHA (Sequence Search and Alignment by the Hashing Algorithm Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_SV – Structural variation (CNVs) detection ssaha_pileup – SNP/indel detection from next-gen data Phusion Development and maintenance of the pipeline Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes TraceSeach Public sequence search facility for all the traces Fuzzypath Short read assembler Informatics Projects Involved
Read mapping by hashing and dynamic programming data base of subject sequences FASTQ file with query sequences banded Smith-Waterman alignment
Pipeline of ssaha_pileup Sequencing Reads SNP File Ssaha_cigar Alignment - ssaha2 Unique placed cigar read file SE Reference fasta Pileup/cons PE Ssaha_pairs Ssaha_clean Ssaha_indelssaha_pileup Indel File
Mapping Score in ssaha2 Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 S map = 50 is the mapping score: R = read length; S max - maximum alignment score (smith-waterman) of the hits on genome; S max2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with S max 30; Second best hit: one base mismatch with S max2 29. The mapping score for this read is S map = 10; Read Reference
SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: S map - read mapping score, from 0 (repeat) to 50 (unique); F q - base quality factor: F q = 1 if Q>=30 F q = 0.5 if Q =30 F q = 0.5 if Q<30; N – number of read coverage at the location. SNP Confidence Score in ssaha2
Getting Personal with J. Craig Venter and James Watson
Datasets n Venter: ABI capillary reads –Celera: 19,397,599 55% in pairs –JCVI: 12,541,352 98% in pairs –Total: 31,938,95172% in pairs n Watson: 454 GS FLX reads –Baylor & Roche 74,198,831 –single end reads with length 150 – 280 bps n Chromosome X Illumina reads –140 million paired Solexa reads at ~45x
IndividualsCount% dbSNP Venter SNP Calling (Capillary) Homozygous SNPs % Heterozygous SNPs % Total SNPs % Watson SNP Calling (454) Homozygous SNPs % Heterozygous SNPs % Total SNPs % X Chromosome SNPs (Solexa) Homozygous SNPs % Heterozygous SNPs % Total SNPs % SNP Results from Three Individuals
Deletio n Insertion Reference Sequence Sample Reads VNTR A’’ A’ Insertion Sample Reads Reference Sequence 1 1’ 2’ 2 Sample Reads Detection of Structural Variations
Deletion VNTRs Insertion Total number: Maximum length (bp): Minimum length (bp): Average length (bp): Affected Bases: Structural Variations against NCBI36 Deletion VNTR Insertion Total number: Maximum length (bp): Minimum length (bp): Average length (bp): Affected bases:
Deletion – Size Distribution
VNTRs – Size Distribution
Simulated Solexa reads : Number of reads: 25,647,985 Genome size: 23.0 Mbp Read length:36 Read coverage:40x Num. of uniquely placed PE reads: 24,303,362 Percentage of placed PE reads:94.5% Num. of uniquely placed SE reads:23,229,651 Percentage of placed SE reads:90.6% Detection results: Number of deletions: 5,816 Number of detected deletions: 5,668(97.5%) Number of false positives:135 (2.3%) Number of insertions: 5,816 Number of detected insertions:5,458(93.8%) Number of false positives:15(0.26%) Indel Detection P.Faciparum 3D7 Simulations
Availability ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pi leup-readme
FuzzyPath and Assemblies from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes
Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.
Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA
Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp
Kmer Extension & Repeat Junctions
Handling of Single Base Variations
ACGTAACTAACAGTT ACGTAACTCACAGTT ACGTAACT ACAGTT Fuzzy Kmers Number of Mismatches between Two Kmers
Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference or Sanger reads or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus
Pileup of Solexa and 454 Reads
Solexa reads: Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly
Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/ bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads
Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/ bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly
Solexa reads : Number of reads: 6,346,317; Finished genome size: 4.7 Mbp; Read length:33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig stats Total number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig:401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors:2 Salmonella delhi5 Solexa Assembly Guided by A Close Reference
The Malaria Genome Project
library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x PCR-free E. coli 0422 x PCR-free P. falciparum 3D72 x PCR-free B. pertussis ST242 x PCR-free P. falciparum 3D72 x PCR-free E. coli 0422 x standard-245 P. falciparum 3D72 x standard-368 P. falciparum 3D72 x standard-851 P. falciparum 3D72 x standard-883 P. falciparum clin2 x Datasets with Various GC Content GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%
Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26, Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies
Acknowledgements: Jim Mullikin Tony Cox – Illumina, UK Tony Cox – Sanger Institute Adam Spargao, Yong Gu Ben Blackburne Hannes Ponstingl Daniel Turner Michael Quail Jane Rogers Richard Durbin