When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012
Outline 1. Introduction Illumina GA/Hiseq Ion Torrent PGM 2. Data processing The accuracy of assembly 3. Application NGS in cancer genomics
Introduction 454 GS FLX Hiseq 2000 SOLiDv4 Sanger 3730xl SE: Single-end reads. PE: Paired-end reads.
Which system is more popular Instrument: $690,000 Cost/Mb: $0.07 Instrument: $68,000 Cost/Mb: $0.63 (318 chip)
Emulsion PCR Ion Torrent PGM system Mineral oil DNA template Beads Contain polymerase+dNTP Beads with no products Adapter P2 Beads with amplified product Adapter P1
Ion Torrent PGM system Ion Torrent PGM finished sequencing an isolate (TY2482) of the outbreak E. coli O104:H4 infection within 72 hours. Seven runs generated 79 Mb of sequence data.
Illumina GA/Hiseq system Later, Illumina Hiseq sequenced the same sample to improve the accuracy. Runs of three libraries generated 2.1 Gb sequence data within 2 weeks.
Comparison in alignment between two systems a use TMAP to align. b use SOAP2 to align. A Rhodobacter sample with high GC content (66%) and 4.2Mb genome was sequenced in Ion Torrent and HiSeq 2000 sequencers.
Comparison in sequencing quality Per base sequence quality of samples generated by FASTQC. The yellow box show the base-calling quality scores across all sequencing reads. The blue line indicates the mean quality score. Q20=99% accuracy. Q30=99.9% accuracy… Illumina Hiseq 2000 Ion Torrent PGM
Comparison in homopolymer error rate Rates of insertions or deletions in homopolymer tracts normalized by homopolymer length. Illumina Hiseq 2000 Ion Torrent PGM
NGS inspired sequencing of new species 13 years; $2.7 billion 5 months; $1.5 million
Challenges in de novo genome assembly How to assemble? High-quality reads A good assembler
Framework of data processing Filter low quality reads Filter or trim adapter reads (optional) Filter PCR duplication reads Remove contaminate reads (mitochondrion, etc) Split tandem repeat reads (di- or tri-) (optional) Remove (Correct) low frequency k-mer reads De novo assembly
Effects of error correction Correcting the reads reduces the number of contigs and scaffolds, increases the contig sizes, and allows the assembler to include more reads SOAPdenovo bee assembly Y-axis: Expected VS observed error rates K-mer occurrence
Assembler evaluated Name Launched year AlgorithmsFeatures ABySS2009De Bruijn graphSmall memory required ALLPATHS-LG2011De Bruijn graphSupport jumping-libraries Bambus22011Repeat detection Designed for polymorphic and metagenomic scaffolding CABOG2008 Overlap-Layout- Consenus Designed for 454 Data MSR-CA2011 De Bruijn graph and Overlap-Layout- Consensus Reduce reads number by grouping SGA2012 Burrows Wheeler Transform Small memory required; mix of short and long reads SOAPdenovo2010De Bruijn graph Designed for short reads; support large k-mer value Velvet2008De Bruijn graphMix of short and long reads
Illustration of assembling based on de Bruijn graphs This method of construction ensures that connected nodes have overlaps of K-1 nucleotides. It can be repeated to construct graphs of a large genome of any size. (a)(a) (b)(b) (c)(c) (d)(d) (e)(e)
Illustration of mis-assemblies A rearrangement style mis-assemblies.(a) Three copy repeat R, with interspersed unique sequences B and C, shown with properly sized and oriented mates. (b) Mis-assembled repeat shown with mis-oriented and expanded mate-pairs. The mis-assembly is caused by co-assembled reads from different repeat copies.
A dot-plot comparison of SOAPdenovo and Velvet scaffolds The finished reference chromosomes are plotted on the x-axis and the assembly scaffolds on the y-axis. Inversion Relocation Insertion Deletion
The best value for each column is shown in bold. Errors = number of misjoins + indel errors >5 bp. Corrected N50 values were computed after correcting contigs and scaffolds by breaking them at each error. Performance of 8 assemblers on R. sphaeroides genomes (~4.60 Mb) SNPs: Single nucleotide polymorphisms. Indels: Insertion and deletion. Inv: Inversions. Reloc: Relocations.
Average contig (a) and scaffold (b) sizes, measured by N50 VS error rates, averaged over all three genomes (S. aureus, R. sphaeroides, and human chromosome 14). In both plots, the best assemblers appear in the upper right. (b)(b) Summary of assemblers performance Comparison of insertion and deletion errors among all eight assemblers for human chromosome 14. Indel errors >5bp were counted. (a)(a)
BRCA mutations associated with a risk of breast cancer Ref BRCA sequences from normal people BRCA sequences from patients Mutation Mapping … …
Variants in exons, introns, and untranslated regions of BRCA1 (a) and BRCA2 (b) Distribution of variants in complete genes of BRCA The data below demonstrates the number of variants detected in BRCA1 and BRCA2. (a)(a) (b)(b)
(a) (c) (b) (d) The variant position is indicated by an arrow, with the corresponding sample, gene, and nucleotide change indicated above each chromatogram. Accuracy of variants detection in NGS data Discrepant nucleotide substitutions between NGS and Sanger sequencing
Conclusion The next-generation sequencing is becoming the now-generation. It has changed the situation of simply sequencing to genome-wide analyses. Its broad usage in sequencing mRNA and non-coding RNA, as well as DNA methylation will reveal the rules of genomic regulation and help to diagnose the genetic diseases.
Illustration of mis-assemblies An inversion style mis-assemblies. (a) Two copy, inverted repeat R, bounding unique sequence B, shown with properly sized and oriented mate-pairs. (b) Mis-assembled repeat shown with mis-oriented mate-pairs.
(a) A G deletion at BRCA1; (b) An A insertion at position at BRCA2; (c) A TG deletion at BRCA2 Identification of indels in BRCA genes by Sanger and NGS (a)(a) (b)(b) (c)
Analysis of 430 genes mutated across seven cancer genomes with DAVID ( Number of mutated genes by GO terms of gene function. Application of NGS in cancer genomics