Computational Advances in Next Generation Sequencing Ion Măndoiu (University of Connecticut) Alex Zelikovsky (Georgia State University) April 28, 2011, Boston, MA
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
Next (=2 nd ) Generation Sequencing Roche/454 FLX Titanium ~1 million reads/run bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD billion PE reads/run 35-50bp read length
Cost of Whole Genome Sequencing C.Venter J. Watson NA18507
Outline Next-gen sequencing technologies – Illumina – SOLiD – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
Illumina Workflow – Library Preparation Genomic DNA mRNA
Illumina Workflow – Cluster Generation
Illumina Workflow – Sequencing by Synthesis
Outline Next-gen sequencing technologies – Illumina – SOLiD – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
SOLiD – emulsion PCR Emulsion PCR used to perform single molecule amplification of pooled library onto magnetic beads
SOLiD – sequencing by ligation
Outline Next-gen sequencing technologies – Illumina – SOLID – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
454 – pyrosequencing Emulsion PCR Single nucleotide addition – Natural nucleotides – DNA ploymerase pauses until complementary nucleotide is dispensed – Nucleotide incorporation triggers enzymatic reaction that results in emission of light
454 sequencing errors Fixed number of incorporated bases vs. light intensity value Incorrect resolution of homopolymers => – over-calls (insertions) 65-75% of errors – under-calls (deletions) 20-30% of errors
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
Re-sequencing De novo sequencing RNA-Seq Non-coding RNAs Structural variation ChIP-Seq Methyl-Seq Metagenomics Paleogenomics Viral quasispecies … many more biological measurements “reduced” to NGS sequencing A transformative technology
De novo assembly pipeline resampling (optional) sample reads, estimate params SAET (optional) error correct reads preprocessor prepare input for Velvet and ASiD Velvet generate repeat graph ASiD fill gaps, translate to base-space analysis (optional) analyze contigs/scaffolds fragment reads or mate-paired reads or paired-end reads SOLiD™ de novo accessory tools 2.0 alignment of reads contigs/scaffolds analysis Addition of pre-assembly error correction by SAET reduces error rate from 3-5% to sub 1%, that enables de novo assembly and increases N50 contig by factor of 3
NGS error correction NGS – more errors than standard Sanger Most common approaches ‒ Use of reference sequence and alignments ‒ Clustering (Pyronoise, SHORAH,… ): define the distances between reads cluster the reads according to this distance calculate the consensus of each cluster Main disadvantage: loss of rare variants ‒ Spectral (k-mer) approach alignment- free (Pevzner, Brinza, EDAR 2010)
K-mer approach: single read 1. Consider the set of k-mers of the read r = ATCCGAT k-mers for k = 4: {ATCC, TCCG, CCGA, CGAT} 2. Calculate the frequency of each k-mer in the whole data set
K-mer approach: removing errors 3.Determine threshold b/w erroneous k-mers and correct ones 4.For each read, cluster k-mers by their frequences. Each cluster determines the error region 5.Delete error regions
SOLiD™ Accuracy Enhancement Tool (SAET) Combines information from multiple readings of the same DNA region and corrects miscalls in the reads De novo aligns reads without using reference Applies only statistically sound corrections Increases accuracy by reducing raw error rate by up to 5 times, e.g., from ~5% ~1% Increases throughput by increasing mappability by up to 30%, e.g., 50% 65% original 50bp long reads from E.coliSAET corrected 50bp long reads from E.coli gray – mismatch to reference
De novo assembly pipeline rsampling (optional) sample reads, estimate params SAET (optional) error correct reads preprocessor prepare input for Velvet and ASiD Velvet generate repeat graph ASiD fill gaps, translate to base-space analysis (optional) analyze contigs/scaffolds fragment reads or mate-paired reads or paired-end reads SOLiD™ de novo accessory tools 2.0 alignment of reads contigs/scaffolds analysis 75% of gaps filled by ASiD step, 6 times increase in N50 contig size
Assemble scaffolds of entire genome using Velvet assembler Map all reads to scaffolds For each gap between two contigs collect hanging mates Error correct and assemble each subset of reads using low coverage cut-off Replace gap with uniquely assembled sequence scaffold coverage ASiD: Assembly assistant for SOLiD™ applied to bacterial de novo assembly
TAGACGA AAAGGTG GGACAAC ATATAC CTAGGC GCGGATT GCCGGTA CGAGTAA GTGTGGG AACCAAA ATATAT CTACTAG GGCGGA....AGCCGGTAGACGAGTAAAGGTGTGGGACAACCAAACTTAC CGAGGATATATACCTACTAGGCAGGCGGATTA.... contig A missing assembly contig B green – adjacent contigs orange – missing assembly blue – fragments included into assembly red – hanging fragments ASiD Assembly
E. Coli dataset E.Coli (4.6Mb) 300x coverage 2x50 insert ~1.3Kb ContigsScaffolds Assisted Assembly Contigs N505.2Kb200Kb30Kb Mean contig length3Kb15Kb12Kb Max contig length23Kb440Kb170Kb Number contigs > 100 nt Sum contig length4.45M4.47M Percent genome covered97.24% Average identity99.8%
Sugarcane BACs dataset 24 sugarcane BACs (~130kb) Single BAC assemble with N50=~5Kbp Assembled contigs have high quality – zero miss- assemblies ~95% of genes entirely inside contigs Addition of 0.5-1x of 454 reads would result in a single contig per BAC Sequencing using longer insert size >600bp would result in a single contig per BAC
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection – Linkage disequilibrium genotype calling – SNP calling and genotyping from transcriptome sequencing data Transcriptome analysis Reconstruction of viral quasispecies Conclusions
Sequencing provides single-base resolution of genetic variation (SNPs, CNVs, genome rearrangements,…) However, determination of both alleles at variable loci is limited by coverage depth due to random nature of shotgun sequencing For Venter and Watson genomes (~7.5x average coverage), comparison with SNP genotyping chips yields only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al. 07, Wheeler et al. 08] Motivation
Allele coverage for heterozygous SNPs (Watson 5.85x avg. coverage)
Allele coverage for heterozygous SNPs (Watson 2.93x avg. coverage)
Allele coverage for heterozygous SNPs (Watson 1.46x avg. coverage)
Allele coverage for heterozygous SNPs (Watson 0.73x avg. coverage)
Allele coverage for heterozygous SNPs (Watson 0.37x avg. coverage)
Most work devoted to de novo variation discovery Coverage-based methods [Levy et al. 07] [Wheeler et al. 08] Min coverage + binomial test for calling heterozygotes [Wendl&Wilson 08] Arbitrary minimum allele coverage k Estimate that 21x coverage required based on idealized theory that “neglects any heuristic inputs” Bayesian methods Maq [Li, Ruan, Durbin 08], SOAPsnp [Li et al. 09] Take into some additional info such as base quality scores and heterozygosity rate priors Prior methods
In [Duitama et al., APBC 2011] we introduced methods incorporating additional sources of information for genotyping known SNPs: – Allele/genotype frequency – Linkage disequilibrium (LD) Experimental results show significantly improved genotyping accuracy for low-coverage sequencing – Similar methods published recently by 1000 Genomes analysis group Do heuristic inputs help?
Biallelic SNPs: 0 = major allele, 1 = minor allele SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Basic notations Let r i denote the set of mapped reads covering SNP locus i and c i =| r i | For a read r in r i, r(i) denotes the allele observed at locus i If q r(i) is the phred quality score of r(i), the probability that r(i) is incorrect is given by
Incorporating base call uncertainty Probability of observing read set r i conditional on G i :
Haplotype structure in human populations
F i = founder haplotype at locus i, H i = observed allele at locus I – For given haplotype h, P(H=h|M) can be computed in O(nK 2 ) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 08, Kimmel&Shamir 05, Scheet&Stephens 06] HMM model of haplotype frequencies F1F1 F2F2 FnFn … H1H1 H2H2 HnHn
F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference
P(f 1 ), P(f’ 1 ), P(f i+1 |f i ), P(f’ i+1 |f’ i ), P(h i |f i ), P(h’ i |f’ i ) trained using Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(g i |h i,h’ i ) set to 1 if h+h’ i =g i and to 0 otherwise Model training This gives conditional read set probabilities of single SNP model
GIVEN: Shotgun read sets r=(r 1, r 2, …, r n ) Quality scores Trained HMM models representing LD in populations of origin for mother/father FIND: Multilocus genotype g*=(g* 1,g* 2,…,g* n ) with maximum posterior probability, i.e., g*=argmax g P(g | r ) Multilocus genotyping problem Theorem: max g P(g | r) cannot be approximated within unless ZPP=NP
Posterior decoding algorithm 1. For each i = 1..n, compute 2. Return
fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities
fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities
fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities
fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities
fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities
Runtime Direct recurrences for computing forward probabilities: Runtime reduced to O(m+nK 3 ) by reusing common terms: where
Hapmap genotypes or haplotypes F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M … … … …… … … Pipeline for LD-Based Genotype Calling
Hapmap genotypes or haplotypes F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M … … … …… … … >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA Mapped reads Reference genome sequence >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT Read sequences Quality scores SNP genotype calls rs T T e-01 rs C T e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs C C e-01 rs A G e-01 rs C C e-01 rs C C e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs A C e-01 rs G G e-01 rs A A e-01 rs A A e-01 rs A A e-01 rs T T e-01 rs G G e-01 rs C G e-01 rs G T e-01 rs G G e-01 rs C C e-01 rs A C e-01 rs G G e-01 rs C C e-01 rs C C e-01 rs C C e-01 Pipeline for LD-Based Genotype Calling
Datasets Dataset Test SNPs Raw reads Raw sequence Mapped reads Avg. mapped SNP cov. Watson K (Affy 500k) 74.2M 265bp avg 19.7Gb 49.8M (67%) 5.85x NA18507 Illumina 2.85M ( Hapmap) 525M 36bp, paired 18.9Gb 397M (78%) 6.10x NA18507 SOLiD 2.85M ( Hapmap) 2.45G 24-44bp, single 75Gb 900M (37%) 9.85x
Mapping Procedure 454 reads mapped using the NUCMER tool of the MUMmer package [Kurtz et al 04] with default parameters – Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) – Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Illumina reads mapped using MAQ [Li et al 08] with default parameters – For reads mapped at multiple positions MAQ returns best position (breaking ties arbitrarily) together with mapping confidence – We filtered bad alignments and discarded paired end reads that are not mapped in pairs using the “submap -p” command SOLiD reads mapped using BioScope – Alignments provided by the authors of [McKernan et al. 99]
Comparison of genotyping methods (NA18507 Illumina, Homozygotes)
Comparison of genotyping methods (NA18507 Illumina, Heterozygous)
HMM posterior accuracy on the 3 datasets
Distribution of allele coverage ratios for heterozygous SNPs
HMM posterior accuracy at varying call rates (Watson 454 reads, 5.85x avg.)
Posterior decoding algorithm has highly scalable running time and yields significant improvements in genotyping calling accuracy compared to previous methods – Improved accuracy makes low-coverage sequencing competitive in cost with microrarrays for next generation GWAS – Open source code available at Summary
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection – Linkage disequilibrium genotype calling – SNP calling and genotyping from transcriptome sequencing data Transcriptome analysis Reconstruction of viral quasispecies Conclusions
Motivation RNA-Seq is the method of choice for studying functional effects of genetic variability – Mature library preparation & sequencing protocols – Much less expensive than genome sequencing Can sequence variants be discovered reliably from RNA-Seq data? – SNVQ: novel Bayesian model for SNV discovery and genotyping from RNA-Seq data [Duitama et al., ICCABS 2011 ] – Particularly appropriate when interest is in expressed mutations (cancer immunotherapy)
Read Mapping Reference genome sequence >ref|NT_ |Mm19_82865_37: Mus musculus chromosome 19 genomic contig, strain C57BL/6J GATCATACTCCTCATGCTGGACATTCTGGTTCCTAGTAT ATCTGGAGAGTTAAGATGGGGAATTATGTCA ACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAA ATATTTCCACGCTTTTTCACTACAGATAAAG AACTGGGACTTGCTTATTTACCTTTAGATGAACAGATTC AGGCTCTGCAAGAAAATAGAATTTTCTTCAT ACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACA GGGATGTCAGGATTCACAATGACAGTGCTGGATGAG +HWI-EAS299_2:2:1:1536:631 ATTACACCACCTTCAGCCCAGGTGGTTGGAGTACTC +HWI-EAS299_2:2:1:771:94 :::::::::::::::::::::::::::2:: Read sequences & quality scores SNP calling G T C A T A T A A C T C 7 1 SNP Calling from Genomic DNA Reads
Mapping mRNA Reads
C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, Spliced read alignment
Mapping and Merging Strategy mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads Mapped reads
Read Merging GenomeCCDSAgree?Hard MergeSoft Merge Unique YesKeep Unique NoThrow UniqueMultipleNoThrowKeep UniqueNot MappedNoKeep MultipleUniqueNoThrowKeep Multiple NoThrow MultipleNot MappedNoThrow Not mappedUniqueNoKeep Not mappedMultipleNoThrow Not mappedNot MappedYesThrow
SNV Detection and Genotyping AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Locus i RiRi r(i) : Base call of read r at locus i ε r(i) : Probability of error reading base call r(i) G i : Genotype at locus i
SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one
Current Models Maq: – Keep just the alleles with the two largest counts – Pr (R i | G i =H i H i ) is the probability of observing k alleles r(i) different than H i – Pr (R i | G i =H i H’ i ) is approximated as a binomial with p=0.5 SOAPsnp – Pr (r i | G i =H i H’ i ) is the average of Pr(r i |H i ) and Pr(r i |G i =H’ i ) – A rank test on the quality scores of the allele calls is used to confirm heterozygocity
SNVQ Model Calculate conditional probabilities by multiplying contributions of individual reads
Experimental Setup 113 million 32bp Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 (NCBI SRA database accession numbers SRX and SRX000566) – We tested genotype calling using as gold standard 3.4 million SNPs with known genotypes for NA12878 available in the database of the Hapmap project – True positive: called variant for which Hapmap genotype coincides – False positive: called variant for which Hapmap genotype does not coincide
Comparison of Mapping Strategies
Comparison of Variant Calling Strategies
Data Filtering
Allow just x reads per start locus to eliminate PCR amplification artifacts [Chepelev et al. 2010] algorithm: – For each locus groups starting reads with 0, 1 and 2 mismatches – Choose at random one read of each group
Comparison of Data Filtering Strategies
Accuracy per RPKM bins
Summary Simple strategy to map mRNA reads using both the reference genome and the CCDS database and new Bayesian model for SNV detection and genotyping Experiments on publicly available datasets show that SNVQ outperforms widely used SNV detection methods
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis – Estimation of expression levels from RNA-Seq reads – Estimation of expression levels from DGE reads Reconstruction of viral quasispecies Conclusions
RNA-Seq protocol Make cDNA & shatter into fragments Sequence fragment ends ABCDE Map reads Gene Expression (GE)Isoform Expression (IE) ABC AC DE Isoform Discovery (ID)
Alternative splicing [Griffith and Marra 07]
Challenges to accurate estimation of gene expression levels Read ambiguity (multireads) What is the gene length? ABCDE
Previous approaches to GE Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]
Read Ambiguity in IE ABCDE AC
Previous approaches to IE [Jiang&Wong 09] – Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution
IsoEM algorithm [Nicolae et al., WABI 2010] Unified probabilistic model and Expectation- Maximization Algorithm (IsoEM) for IE considering – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Repeat and hexamer bias correction
Read-isoform compatibility
Fragment length distribution Paired reads ABC AC ABC ACAC ABC i j F a (i) F a (j)
Fragment length distribution Single reads ABC AC ABC AC ABC AC i j F a (i) F a (j)
IsoEM pseudocode E-step M- step
Implementation details Collapse identical reads into read classes i1 Isoforms i2i3i4i5i6 Reads (i1,i2)(i3,i4)(i3,i5)(i3,i4) LCA(i3,i4)
Implementation details Run EM on connected components, in parallel i1 Isoforms i2 i3 i4 i5i6
Simulation setup Human genome UCSC known isoforms GNFAtlas2 gene expression levels – Uniform/geometric expression of gene isoforms Normally distributed fragment lengths – Mean 250, std. dev. 25
Accuracy measures Error Fraction (EF t ) – Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) – Threshold t for which EF is 50% r 2
Error fraction curves - isoforms 30M single reads of length 25 (simulated)
Error fraction curves - genes 30M single reads of length 25 (simulated)
MPE and EF 15 by gene expression level 30M single reads of length 25
MAQC data RNA samples: UHRR, HBRR 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] Bases called using both auto and phi X calibration for 2 libraries qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]
r 2 comparison for MAQC samples
MPE comparison for MAQC samples
Read length effect on IE MPE Fixed sequencing throughput (750Mb) Single Reads Paired Reads
Read length effect on IE r 2 Fixed sequencing throughput (750Mb)
Effect of pairs & strand information 75bp reads
Runtime scalability Scalability experiments conducted on a Dell PowerEdge R900 – Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal memory
Summary Efficient EM algorithm for estimating isoform/gene expression levels – Integrates fragment length distribution, base qualities, pair and strand info – Java implementation available at
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis – Estimation of expression levels from RNA-Seq reads – Estimation of expression levels from DGE reads Reconstruction of viral quasispecies Conclusions
DGE/SAGE-Seq protocol AAAAA Gene Expression (GE) Cleave with tagging enzyme CATG Map tags ABCDE Cleave with anchoring enzyme (AE) AAAAA CATG AE TCCRAC AAAAA CATG AETE Attach primer for tagging enzyme (TE)
Inference algorithms for DGE data Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10] Heuristic rescue of some ambiguous tags [Wu et al. 10] DGE-EM algorithm [Nicolae & Mandoiu, ISBRA 2011] o Uses all tags, including all ambiguous ones o Uses quality scores o Takes into account partial digest and gene isoforms
Tag formation probability
Tag-isoform compatibility
assign random values to all f(i) while not converged DGE-EM algorithm E-step init all n(i,j) to 0 for each tag t for (i,j,w) in t M-step for each isoform i
MAQC data DGE 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09] Anchoring enzyme DpnII (GATC) RNA-Seq 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]
DGE-EM vs. Uniq on HBRR Library 4
DGE vs. RNA-Seq
Synthetic data 1-30M tags, lengths 14-26bp UCSC hg19 genome and known isoforms Simulated expression levels – Gene expression for 5 tissues from the GNFAtlas2 – Geometric expression for the isoforms of each gene Anchoring enzymes from REBASE – DpnII (GATC) [Asmann et al. 09] – NlaIII (CATG) [Wu et al. 10] – CviJI (RGCY, R=G or A, Y=C or T)
MPE for 30M 21bp tags RNA-Seq: 8.3 MPE
Anchoring enzyme statistics
Summary New DGE-EM algorithm – Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion – Source code freely availabe at RNA-Seq and DGE based estimates have comparable cost-normalized accuracy on MAQC data – When using best inference algorithm for each type of data Simulations suggest possible DGE protocol improvements – Enzymes with degenerate recognition sites (e.g. CviJI) – Optimizing cutting probability
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies – Quasispecies assembly problem – VISPA: Viral Spectrum Assembly Tool Conclusions
Viral Quasispecies RNA viruses (HIV, HCV) – Many replication mistakes – Quasispecies (qsps) – = co-existing closely related variants Variants differ in – virulence – ability to escape the immune system – resistance to antiviral therapies – tissue tropism How do qsps contribute to viral persistence and evolution? We need a software that assembles reads to multiple genomes!
Quasispecies Spectrum Reconstruction (QSR) Problem Given – pyrosequencing reads from a quasispecies population of unknown size and distribution Reconstruct the quasispecies spectrum sequences frequencies
State-of-the-Art Tools – ShoRAH (O. Zagordi et al ): – probabilistic clustering #clusters: Dirichlet process mixture – Amplicon-based (Prosperi et al 2011) – determined amplicon partition – measure of population diversity – VISPA (Astrovskaya et al 2011) – Max bandwidth path in weighted graphs – Accounting for typing error & mutation rate
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies – Quasispecies assembly problem – VISPA: Viral Spectrum Assembly Tool Conclusions
Viral Spectrum Assembler (ViSpA) Flow
Alignment Reference sequence is available Multiple coverage No repeats => unique alignment Alignment score – Minimizes Hamming distances – Penalizes indels more than mismatches Deletions Insertions
Preprocessing of Aligned Reads D 1.Deletions in reads: D I 2.Insertions into reference: I 3.Error correction all NReplace deletions, confirmed by a single read, with either allele value that is present in all other reads or N Remove insertions, confirmed by a single read
Distinguish rare mutations from genotyping errors ViSpA = Replace unique outliers ShoRAH (Zagordi et al, 2010) – Probabilistic clustering – 3 overlapping windows EDAR (Zhao et al, 2010) – Count of k-mers – Works for qsps Read Error Correction
Read Graph: Vertices Subread with n mismatches Subread = completely contained in some read with ≤ n mismatches. Superread Superread = not a subread => the vertex in the read graph. ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT
Read Graph: Edges Several paths may represent the same sequence. Edge b/w two vertices if there is an overlap between superreads and they agree on their overlap with ≤ m mismatches Transitive reduction
Edge Cost Choose the most probable source-sink path through each vertex. Cost measures the uncertainty that two superreads belong to the same quasispecies. OverhangΔ Overhang Δ is the shift in start positions of two overlapping superreads. Δ j where j is the number of mismatches oε in overlap o, ε is 454 error rate.
Path to Sequence The s-t-Max Bandwidth Path per vertex (maximizing probability) 1.Build coarse sequence out of path’s superreads: N – For each position: >70%-majority if it exists, otherwise N 2.Replace coarse sequence with weighted consensus obtained on all reads 3.Select unique sequences out of constructed sequences. Repetitive sequences = evidence of real qsps sequence
Expectation Maximization Bipartite graph: – Q q is a candidate with frequency f q – R r is a read with observed frequency o r – Weight h q,r = probability that read r is produced by quasispecies q with j mismatches E step: M step:
Experimental Validation Simulations – Error-free reads from known HCV quasispecies FlowSim – Reads with errors generated by FlowSim (Balser et al 2010/Sept) Real 454 reads – HCV data – HIV data (10 qsps) ShoRAH Comparison with ShoRAH
Simulations: Error-Free Reads 44 real qsps (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) Simulated reads: – 4 populations sizes: 10, 20, 30, 40 sequences – Geometric distribution – The quasispecies population: Number of reads is (20K, 40K, 60K, 80K, 100K } N(μ,400)μ The read length distribution N(μ,400); μ is varied from 200 to 500
Results
Simulations with FlowSim 44 real quasispecies sequences (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) 30K reads with average length 350bp 100 bootstrapping tests on 10% - reduced data ‒ For the i-th (i = 1,.., 10) most frequent sequence assembled on the whole data, we record its reproducibility = percentage of runs when there is a match (exact or with at most k mismatches) among 10 most frequent sequences found on reduced data.
Bootstraping Tests ShoRAH outperforms ViSpA due to its read correction. If ViSpA is used on ShoRAH-corrected reads (ShoRAHreads+ViSpA), the results drastically improves ViSpA is better in assembling sequences => ViSpA is better in assembling sequences
454 Reads of HCV Qsps (Courtesy P. Balfe) reads from 5.2Kb-long region of HCV-1a genomes from intravenous drug user being infected for less than 3 months Segemehl software – reads (average read length 292bp) – ~77% of reads has at least one indel – ~7% of reads with at least one N
NJ Tree for 10 Most Frequent Qsps: ShoRAH + ViSpA ShoRAH 1 qsps with viable protein ViSpA: 10 qsps with viable proteins Top 20: 16 are viable.
Robustness ShoRAH : 35% of times infers only the 3rd most frequent sequence. ViSpA repeats 7 sequences >= 15% times and the top sequence is repeated 40% times.
Systematic Sequencing Errors Stop codons in amino-acid sequences – ShoRAH: only 1 out of top 10 corresponds to viable protein – ViSpA: 16 out of 20 sequences represent viable proteins (Manual) Resolution method for quasispecies sequences: – Find the frame (MSTNP ) – Find the first stop-codon position in qsp – Align the amino-acid translations of qsp and the reference – In the alignment go left from the stop-codon until the correct alignment – find first nucleotide monomer to the left – Try to extend or to reduce the monomer by one base and choose the one which matches the reference
Example Reference Contig
454 Reads of HIV Qsps (Zagordi et al.2010) 55,611 reads from ten 1.5Kbp long region of HIV-1 (average read length 345bp) – No removal of low-quality reads – ~99% of reads has at least one indel – ~11.6 % of reads with at least one N 2 qsps <=4 ShoRAH correctly infers only 2 qsps sequences with <=4 mismatches. 5 qsps <=2 ViSpA correctly infers 5 qsps with <=2 mismatches, 2 qsps are inferred exactly. 3 ShoRAHreads+ViSpA infers 3 qsps exactly.
Summary Viral Spectrum Assembler (ViSpA) tool – Simple error correction – Qsps assembling based on maximum-bandwidth paths in weighted read graphs – Frequency estimation via EM on all reads – Freely available at ShoRAH’s error correction algorithm is prone to overcorrection ViSpA is better than ShoRAH in assembling sequences
Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions
The range of NGS applications continues to expand, fueled by advances in technology Improved sample prep protocols 3 rd generation: Pacific Biosciences, Ion Torrent Development of sophisticated analysis methods remains critical for fully realizing the potential of sequencing technologies
Further readings Error correction Zhao X, Palmer LE, Bolanos R, Mircean C, Fasulo D, Wittenberg GM. “EDAR: an efficient error detection and removal algorithm for next generation sequencing data,” J Comp. Biol (11): Ilie L, Fazayeli F, Ilie S. HiTEC: accurate error correction in high-throughput sequencing data, Bioinformatics 27(3) (2011) Read mapping Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25. Kurtz S, Sharma CM, Khaitovich P, Vogel J., Stadler, PF, Hoffmann S, Otto C, and Hackermuller J. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol, 5(9):e , Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111.
Further readings SNV discovery and genotyping H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(1):1851–1858, R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. SNP detection for massively parallel whole-genome resequencing. Genome Research, 19:1124–1132, I. Chepelev, G. Wei, Q. Tang, and K. Zhao. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Research, 37(16):e106, J. Duitama and J. Kennedy and S. Dinakar and Y. Hernandez and Y. Wu and I.I. Mandoiu, Linkage Disequilibrium Based Genotype Calling from Low-Coverage Shotgun Sequencing Reads, BMC Bioinformatics 12(Suppl 1):S53, J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants fromWhole Transcriptome Sequencing Data, Proc. 1st IEEE International Conference on Computational Advances in Bio and Medical Sciences, pp , S.Q. Le and R. Durbin: SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome research, to appear.
Further readings Estimation of gene expression levels from RNA-Seq data Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 2008, 5(7):621–628. Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 2009, 25(8):1026–1032. Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 2010, 26(4):493–500. Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 2010, 28(5):511–515. M. Nicolae and S. Mangul and I.I. Mandoiu and A. Zelikovsky, Estimation of alternative splicing isoform frequencies from RNA-Seq data, Algorithms for Molecular Biology, to appear, preliminary version in Proc. WABI Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology 2011, 12(3):R22.
Estimation of gene expression levels from DGE data Y. Asmann, E.W. Klee, E.A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben, and J.-P. Kocher. 3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics, 10(1):531, Z.J. Wu, C.A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J.S. Liu, K. Polyak, and X.S. Liu. Gene expression profiling of human breast tissue samples using SAGE-Seq. Genome Research, 20(12):1730–1739, M. Nicolae and I.I. Mandoiu, Accurate Estimation of Gene Expression Levels from DGE Sequencing Data, Proc. ISBRA 2011, to appear. Further readings
Viral quasispecies reconstruction Zagordi O, Klein R, Daumer M, and Beerenwinkel N. “Error correction of next- generation sequencing data and reliable estimation of HIV quasispecies,” Nucleic Acids Research, 38(21):7400–7409, Prosperi M, Prosperi L, Bruselles A, Abbate I, Rozera G, Vincenti D, Solmone M, Capobianchi M, and Ulivi G. “Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics, 12(1):5+, 2011 Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A. “Inferring Viral Quasispecies Spectra from 454 Pyrosequencing Reads,” BMC Bioinformatics, to appear. Zagordi O, Bhattacharya A, Eriksson N, and Beerenwinkel N. “ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data,” BMC Bioinformatics, to appear. Balser S, Malde K, Lanzen A, Sharma A, and Jonassen I. “Characteristics of 454 pyrosequencing data–enabling realistic simulation with FlowSim,” Bioinformatics, 26:i420–5, Further readings
Software packages Genotyping and variant detection o LD-based genotype calling: o SNV detection and genotyping from RNA-Seq reads: Inference of gene expression levels o From RNA-Seq reads: o From DGE reads: Reconstruction of viral quasispecies o
Acknowledgments NSF (awards , , and ) National Institute of Food and Agriculture (award ) UCONN Research Foundation (UCIG grant) GSU Molecular Basis of Disease Fellowship Jorge Duitama (KU Leuven) Marius Nicolae (UConn) Justin Kennedy (Sonalysts) Sanjiv Dinakar (UMD) Yozen Hernández (Hunter College) Pramod K. Srivastava (UCHC) Irina Astrovskaya (GSU) Bassam Tork (GSU) Serghei Mangul (GSU) Kelly Westbrooks (Life Tech) Dumitru Brinza (Life Tech) Peter Balfe (Birmingham U.) Pavel Skums (CDC) Yuri Khudyakov (CDC)