Computational Advances in Next Generation Sequencing Ion Măndoiu (University of Connecticut) Alex Zelikovsky (Georgia State University) April 28, 2011,

Slides:

Advertisements

Similar presentations

Marius Nicolae Computer Science and Engineering Department

Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.

Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Next-generation sequencing

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

University of Connecticut

Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Ion Mandoiu Computer Science and Engineering Department

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.

Mining SNPs from EST Databases Picoult-Newberg et al. (1999)

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Bioinformatics Tools for Personalized Cancer Immunotherapy

RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.

Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.

Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.

Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Bioinformatics Pipelines for RNA-Seq Data Analysis

Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.

Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.

High Throughput Sequencing

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

Next generation sequencing Xusheng Wang 4/29/2010.

Li and Dewey BMC Bioinformatics 2011, 12:323

Todd J. Treangen, Steven L. Salzberg

Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering

Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.

Introduction to RNAseq

Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.

Imputation-based local ancestry inference in admixed populations

Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.

A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,

Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

The Haplotype Blocks Problems Wu Ling-Yun

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

Alexander Zelikovsky Computer Science Department

Imputation-based local ancestry inference in admixed populations

Discovery tools for human genetic variations

Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi

Sequence Analysis - RNA-Seq 2

Presentation transcript:

Computational Advances in Next Generation Sequencing Ion Măndoiu (University of Connecticut) Alex Zelikovsky (Georgia State University) April 28, 2011, Boston, MA

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

Next (=2 nd ) Generation Sequencing Roche/454 FLX Titanium ~1 million reads/run bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD billion PE reads/run 35-50bp read length

Cost of Whole Genome Sequencing C.Venter J. Watson NA18507

Outline Next-gen sequencing technologies – Illumina – SOLiD – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

Illumina Workflow – Library Preparation Genomic DNA mRNA

Illumina Workflow – Cluster Generation

Illumina Workflow – Sequencing by Synthesis

Outline Next-gen sequencing technologies – Illumina – SOLiD – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

SOLiD – emulsion PCR Emulsion PCR used to perform single molecule amplification of pooled library onto magnetic beads

SOLiD – sequencing by ligation

Outline Next-gen sequencing technologies – Illumina – SOLID – 454 Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

454 – pyrosequencing Emulsion PCR Single nucleotide addition – Natural nucleotides – DNA ploymerase pauses until complementary nucleotide is dispensed – Nucleotide incorporation triggers enzymatic reaction that results in emission of light

454 sequencing errors Fixed number of incorporated bases vs. light intensity value Incorrect resolution of homopolymers => – over-calls (insertions) 65-75% of errors – under-calls (deletions) 20-30% of errors

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

Re-sequencing De novo sequencing RNA-Seq Non-coding RNAs Structural variation ChIP-Seq Methyl-Seq Metagenomics Paleogenomics Viral quasispecies … many more biological measurements “reduced” to NGS sequencing A transformative technology

De novo assembly pipeline resampling (optional) sample reads, estimate params SAET (optional) error correct reads preprocessor prepare input for Velvet and ASiD Velvet generate repeat graph ASiD fill gaps, translate to base-space analysis (optional) analyze contigs/scaffolds fragment reads or mate-paired reads or paired-end reads SOLiD™ de novo accessory tools 2.0 alignment of reads contigs/scaffolds analysis Addition of pre-assembly error correction by SAET reduces error rate from 3-5% to sub 1%, that enables de novo assembly and increases N50 contig by factor of 3

NGS error correction NGS – more errors than standard Sanger Most common approaches ‒ Use of reference sequence and alignments ‒ Clustering (Pyronoise, SHORAH,… ): define the distances between reads cluster the reads according to this distance calculate the consensus of each cluster Main disadvantage: loss of rare variants ‒ Spectral (k-mer) approach alignment- free (Pevzner, Brinza, EDAR 2010)

K-mer approach: single read 1. Consider the set of k-mers of the read r = ATCCGAT k-mers for k = 4: {ATCC, TCCG, CCGA, CGAT} 2. Calculate the frequency of each k-mer in the whole data set

K-mer approach: removing errors 3.Determine threshold b/w erroneous k-mers and correct ones 4.For each read, cluster k-mers by their frequences. Each cluster determines the error region 5.Delete error regions

SOLiD™ Accuracy Enhancement Tool (SAET) Combines information from multiple readings of the same DNA region and corrects miscalls in the reads De novo aligns reads without using reference Applies only statistically sound corrections Increases accuracy by reducing raw error rate by up to 5 times, e.g., from ~5%  ~1% Increases throughput by increasing mappability by up to 30%, e.g., 50%  65% original 50bp long reads from E.coliSAET corrected 50bp long reads from E.coli gray – mismatch to reference

De novo assembly pipeline rsampling (optional) sample reads, estimate params SAET (optional) error correct reads preprocessor prepare input for Velvet and ASiD Velvet generate repeat graph ASiD fill gaps, translate to base-space analysis (optional) analyze contigs/scaffolds fragment reads or mate-paired reads or paired-end reads SOLiD™ de novo accessory tools 2.0 alignment of reads contigs/scaffolds analysis 75% of gaps filled by ASiD step, 6 times increase in N50 contig size

Assemble scaffolds of entire genome using Velvet assembler Map all reads to scaffolds For each gap between two contigs collect hanging mates Error correct and assemble each subset of reads using low coverage cut-off Replace gap with uniquely assembled sequence scaffold coverage ASiD: Assembly assistant for SOLiD™ applied to bacterial de novo assembly

TAGACGA AAAGGTG GGACAAC ATATAC CTAGGC GCGGATT GCCGGTA CGAGTAA GTGTGGG AACCAAA ATATAT CTACTAG GGCGGA....AGCCGGTAGACGAGTAAAGGTGTGGGACAACCAAACTTAC CGAGGATATATACCTACTAGGCAGGCGGATTA.... contig A missing assembly contig B green – adjacent contigs orange – missing assembly blue – fragments included into assembly red – hanging fragments ASiD Assembly

E. Coli dataset E.Coli (4.6Mb) 300x coverage 2x50 insert ~1.3Kb ContigsScaffolds Assisted Assembly Contigs N505.2Kb200Kb30Kb Mean contig length3Kb15Kb12Kb Max contig length23Kb440Kb170Kb Number contigs > 100 nt Sum contig length4.45M4.47M Percent genome covered97.24% Average identity99.8%

Sugarcane BACs dataset 24 sugarcane BACs (~130kb) Single BAC assemble with N50=~5Kbp Assembled contigs have high quality – zero miss- assemblies ~95% of genes entirely inside contigs Addition of 0.5-1x of 454 reads would result in a single contig per BAC Sequencing using longer insert size >600bp would result in a single contig per BAC

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection – Linkage disequilibrium genotype calling – SNP calling and genotyping from transcriptome sequencing data Transcriptome analysis Reconstruction of viral quasispecies Conclusions

Sequencing provides single-base resolution of genetic variation (SNPs, CNVs, genome rearrangements,…) However, determination of both alleles at variable loci is limited by coverage depth due to random nature of shotgun sequencing For Venter and Watson genomes (~7.5x average coverage), comparison with SNP genotyping chips yields only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al. 07, Wheeler et al. 08] Motivation

Allele coverage for heterozygous SNPs (Watson 5.85x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 2.93x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 1.46x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 0.73x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 0.37x avg. coverage)

Most work devoted to de novo variation discovery Coverage-based methods [Levy et al. 07] [Wheeler et al. 08] Min coverage + binomial test for calling heterozygotes [Wendl&Wilson 08] Arbitrary minimum allele coverage k Estimate that 21x coverage required based on idealized theory that “neglects any heuristic inputs” Bayesian methods Maq [Li, Ruan, Durbin 08], SOAPsnp [Li et al. 09] Take into some additional info such as base quality scores and heterozygosity rate priors Prior methods

In [Duitama et al., APBC 2011] we introduced methods incorporating additional sources of information for genotyping known SNPs: – Allele/genotype frequency – Linkage disequilibrium (LD) Experimental results show significantly improved genotyping accuracy for low-coverage sequencing – Similar methods published recently by 1000 Genomes analysis group Do heuristic inputs help?

Biallelic SNPs: 0 = major allele, 1 = minor allele SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Basic notations Let r i denote the set of mapped reads covering SNP locus i and c i =| r i | For a read r in r i, r(i) denotes the allele observed at locus i If q r(i) is the phred quality score of r(i), the probability that r(i) is incorrect is given by

Incorporating base call uncertainty Probability of observing read set r i conditional on G i :

Haplotype structure in human populations

F i = founder haplotype at locus i, H i = observed allele at locus I – For given haplotype h, P(H=h|M) can be computed in O(nK 2 ) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 08, Kimmel&Shamir 05, Scheet&Stephens 06] HMM model of haplotype frequencies F1F1 F2F2 FnFn … H1H1 H2H2 HnHn

F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference

P(f 1 ), P(f’ 1 ), P(f i+1 |f i ), P(f’ i+1 |f’ i ), P(h i |f i ), P(h’ i |f’ i ) trained using Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(g i |h i,h’ i ) set to 1 if h+h’ i =g i and to 0 otherwise Model training This gives conditional read set probabilities of single SNP model

GIVEN: Shotgun read sets r=(r 1, r 2, …, r n ) Quality scores Trained HMM models representing LD in populations of origin for mother/father FIND: Multilocus genotype g*=(g* 1,g* 2,…,g* n ) with maximum posterior probability, i.e., g*=argmax g P(g | r ) Multilocus genotyping problem Theorem: max g P(g | r) cannot be approximated within unless ZPP=NP

Posterior decoding algorithm 1. For each i = 1..n, compute 2. Return

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

Runtime Direct recurrences for computing forward probabilities: Runtime reduced to O(m+nK 3 ) by reusing common terms: where

Hapmap genotypes or haplotypes F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M … … … …… … … Pipeline for LD-Based Genotype Calling

Hapmap genotypes or haplotypes F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M … … … …… … … >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA Mapped reads Reference genome sequence >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT Read sequences Quality scores SNP genotype calls rs T T e-01 rs C T e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs C C e-01 rs A G e-01 rs C C e-01 rs C C e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs A C e-01 rs G G e-01 rs A A e-01 rs A A e-01 rs A A e-01 rs T T e-01 rs G G e-01 rs C G e-01 rs G T e-01 rs G G e-01 rs C C e-01 rs A C e-01 rs G G e-01 rs C C e-01 rs C C e-01 rs C C e-01 Pipeline for LD-Based Genotype Calling

Datasets Dataset Test SNPs Raw reads Raw sequence Mapped reads Avg. mapped SNP cov. Watson K (Affy 500k) 74.2M 265bp avg 19.7Gb 49.8M (67%) 5.85x NA18507 Illumina 2.85M ( Hapmap) 525M 36bp, paired 18.9Gb 397M (78%) 6.10x NA18507 SOLiD 2.85M ( Hapmap) 2.45G 24-44bp, single 75Gb 900M (37%) 9.85x

Mapping Procedure 454 reads mapped using the NUCMER tool of the MUMmer package [Kurtz et al 04] with default parameters – Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) – Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Illumina reads mapped using MAQ [Li et al 08] with default parameters – For reads mapped at multiple positions MAQ returns best position (breaking ties arbitrarily) together with mapping confidence – We filtered bad alignments and discarded paired end reads that are not mapped in pairs using the “submap -p” command SOLiD reads mapped using BioScope – Alignments provided by the authors of [McKernan et al. 99]

Comparison of genotyping methods (NA18507 Illumina, Homozygotes)

Comparison of genotyping methods (NA18507 Illumina, Heterozygous)

HMM posterior accuracy on the 3 datasets

Distribution of allele coverage ratios for heterozygous SNPs

HMM posterior accuracy at varying call rates (Watson 454 reads, 5.85x avg.)

Posterior decoding algorithm has highly scalable running time and yields significant improvements in genotyping calling accuracy compared to previous methods – Improved accuracy makes low-coverage sequencing competitive in cost with microrarrays for next generation GWAS – Open source code available at Summary

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection – Linkage disequilibrium genotype calling – SNP calling and genotyping from transcriptome sequencing data Transcriptome analysis Reconstruction of viral quasispecies Conclusions

Motivation RNA-Seq is the method of choice for studying functional effects of genetic variability – Mature library preparation & sequencing protocols – Much less expensive than genome sequencing Can sequence variants be discovered reliably from RNA-Seq data? – SNVQ: novel Bayesian model for SNV discovery and genotyping from RNA-Seq data [Duitama et al., ICCABS 2011 ] – Particularly appropriate when interest is in expressed mutations (cancer immunotherapy)

Read Mapping Reference genome sequence >ref|NT_ |Mm19_82865_37: Mus musculus chromosome 19 genomic contig, strain C57BL/6J GATCATACTCCTCATGCTGGACATTCTGGTTCCTAGTAT ATCTGGAGAGTTAAGATGGGGAATTATGTCA ACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAA ATATTTCCACGCTTTTTCACTACAGATAAAG AACTGGGACTTGCTTATTTACCTTTAGATGAACAGATTC AGGCTCTGCAAGAAAATAGAATTTTCTTCAT ACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACA GGGATGTCAGGATTCACAATGACAGTGCTGGATGAG +HWI-EAS299_2:2:1:1536:631 ATTACACCACCTTCAGCCCAGGTGGTTGGAGTACTC +HWI-EAS299_2:2:1:771:94 :::::::::::::::::::::::::::2:: Read sequences & quality scores SNP calling G T C A T A T A A C T C 7 1 SNP Calling from Genomic DNA Reads

Mapping mRNA Reads

C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, Spliced read alignment

Mapping and Merging Strategy mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads Mapped reads

Read Merging GenomeCCDSAgree?Hard MergeSoft Merge Unique YesKeep Unique NoThrow UniqueMultipleNoThrowKeep UniqueNot MappedNoKeep MultipleUniqueNoThrowKeep Multiple NoThrow MultipleNot MappedNoThrow Not mappedUniqueNoKeep Not mappedMultipleNoThrow Not mappedNot MappedYesThrow

SNV Detection and Genotyping AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Locus i RiRi r(i) : Base call of read r at locus i ε r(i) : Probability of error reading base call r(i) G i : Genotype at locus i

SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

Current Models Maq: – Keep just the alleles with the two largest counts – Pr (R i | G i =H i H i ) is the probability of observing k alleles r(i) different than H i – Pr (R i | G i =H i H’ i ) is approximated as a binomial with p=0.5 SOAPsnp – Pr (r i | G i =H i H’ i ) is the average of Pr(r i |H i ) and Pr(r i |G i =H’ i ) – A rank test on the quality scores of the allele calls is used to confirm heterozygocity

SNVQ Model Calculate conditional probabilities by multiplying contributions of individual reads

Experimental Setup 113 million 32bp Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 (NCBI SRA database accession numbers SRX and SRX000566) – We tested genotype calling using as gold standard 3.4 million SNPs with known genotypes for NA12878 available in the database of the Hapmap project – True positive: called variant for which Hapmap genotype coincides – False positive: called variant for which Hapmap genotype does not coincide

Comparison of Mapping Strategies

Comparison of Variant Calling Strategies

Data Filtering

Allow just x reads per start locus to eliminate PCR amplification artifacts [Chepelev et al. 2010] algorithm: – For each locus groups starting reads with 0, 1 and 2 mismatches – Choose at random one read of each group

Comparison of Data Filtering Strategies

Accuracy per RPKM bins

Summary Simple strategy to map mRNA reads using both the reference genome and the CCDS database and new Bayesian model for SNV detection and genotyping Experiments on publicly available datasets show that SNVQ outperforms widely used SNV detection methods

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis – Estimation of expression levels from RNA-Seq reads – Estimation of expression levels from DGE reads Reconstruction of viral quasispecies Conclusions

RNA-Seq protocol Make cDNA & shatter into fragments Sequence fragment ends ABCDE Map reads Gene Expression (GE)Isoform Expression (IE) ABC AC DE Isoform Discovery (ID)

Alternative splicing [Griffith and Marra 07]

Challenges to accurate estimation of gene expression levels Read ambiguity (multireads) What is the gene length? ABCDE

Previous approaches to GE Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE ABCDE AC

Previous approaches to IE [Jiang&Wong 09] – Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution

IsoEM algorithm [Nicolae et al., WABI 2010] Unified probabilistic model and Expectation- Maximization Algorithm (IsoEM) for IE considering – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Repeat and hexamer bias correction

Read-isoform compatibility

Fragment length distribution Paired reads ABC AC ABC ACAC ABC i j F a (i) F a (j)

Fragment length distribution Single reads ABC AC ABC AC ABC AC i j F a (i) F a (j)

IsoEM pseudocode E-step M- step

Implementation details Collapse identical reads into read classes i1 Isoforms i2i3i4i5i6 Reads (i1,i2)(i3,i4)(i3,i5)(i3,i4) LCA(i3,i4)

Implementation details Run EM on connected components, in parallel i1 Isoforms i2 i3 i4 i5i6

Simulation setup Human genome UCSC known isoforms GNFAtlas2 gene expression levels – Uniform/geometric expression of gene isoforms Normally distributed fragment lengths – Mean 250, std. dev. 25

Accuracy measures Error Fraction (EF t ) – Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) – Threshold t for which EF is 50% r 2

Error fraction curves - isoforms 30M single reads of length 25 (simulated)

Error fraction curves - genes 30M single reads of length 25 (simulated)

MPE and EF 15 by gene expression level 30M single reads of length 25

MAQC data RNA samples: UHRR, HBRR 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] Bases called using both auto and phi X calibration for 2 libraries qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]

r 2 comparison for MAQC samples

MPE comparison for MAQC samples

Read length effect on IE MPE Fixed sequencing throughput (750Mb) Single Reads Paired Reads

Read length effect on IE r 2 Fixed sequencing throughput (750Mb)

Effect of pairs & strand information 75bp reads

Runtime scalability Scalability experiments conducted on a Dell PowerEdge R900 – Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal memory

Summary Efficient EM algorithm for estimating isoform/gene expression levels – Integrates fragment length distribution, base qualities, pair and strand info – Java implementation available at

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis – Estimation of expression levels from RNA-Seq reads – Estimation of expression levels from DGE reads Reconstruction of viral quasispecies Conclusions

DGE/SAGE-Seq protocol AAAAA Gene Expression (GE) Cleave with tagging enzyme CATG Map tags ABCDE Cleave with anchoring enzyme (AE) AAAAA CATG AE TCCRAC AAAAA CATG AETE Attach primer for tagging enzyme (TE)

Inference algorithms for DGE data Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10] Heuristic rescue of some ambiguous tags [Wu et al. 10] DGE-EM algorithm [Nicolae & Mandoiu, ISBRA 2011] o Uses all tags, including all ambiguous ones o Uses quality scores o Takes into account partial digest and gene isoforms

Tag formation probability

Tag-isoform compatibility

assign random values to all f(i) while not converged DGE-EM algorithm E-step init all n(i,j) to 0 for each tag t for (i,j,w) in t M-step for each isoform i

MAQC data DGE 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09] Anchoring enzyme DpnII (GATC) RNA-Seq 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]

DGE-EM vs. Uniq on HBRR Library 4

DGE vs. RNA-Seq

Synthetic data 1-30M tags, lengths 14-26bp UCSC hg19 genome and known isoforms Simulated expression levels – Gene expression for 5 tissues from the GNFAtlas2 – Geometric expression for the isoforms of each gene Anchoring enzymes from REBASE – DpnII (GATC) [Asmann et al. 09] – NlaIII (CATG) [Wu et al. 10] – CviJI (RGCY, R=G or A, Y=C or T)

MPE for 30M 21bp tags RNA-Seq: 8.3 MPE

Anchoring enzyme statistics

Summary New DGE-EM algorithm – Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion – Source code freely availabe at RNA-Seq and DGE based estimates have comparable cost-normalized accuracy on MAQC data – When using best inference algorithm for each type of data Simulations suggest possible DGE protocol improvements – Enzymes with degenerate recognition sites (e.g. CviJI) – Optimizing cutting probability

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies – Quasispecies assembly problem – VISPA: Viral Spectrum Assembly Tool Conclusions

Viral Quasispecies RNA viruses (HIV, HCV) – Many replication mistakes – Quasispecies (qsps) – = co-existing closely related variants Variants differ in – virulence – ability to escape the immune system – resistance to antiviral therapies – tissue tropism How do qsps contribute to viral persistence and evolution? We need a software that assembles reads to multiple genomes!

Quasispecies Spectrum Reconstruction (QSR) Problem Given – pyrosequencing reads from a quasispecies population of unknown size and distribution Reconstruct the quasispecies spectrum sequences frequencies

State-of-the-Art Tools – ShoRAH (O. Zagordi et al ): – probabilistic clustering #clusters: Dirichlet process mixture – Amplicon-based (Prosperi et al 2011) – determined amplicon partition – measure of population diversity – VISPA (Astrovskaya et al 2011) – Max bandwidth path in weighted graphs – Accounting for typing error & mutation rate

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies – Quasispecies assembly problem – VISPA: Viral Spectrum Assembly Tool Conclusions

Viral Spectrum Assembler (ViSpA) Flow

Alignment Reference sequence is available Multiple coverage No repeats => unique alignment Alignment score – Minimizes Hamming distances – Penalizes indels more than mismatches Deletions Insertions

Preprocessing of Aligned Reads D 1.Deletions in reads: D I 2.Insertions into reference: I 3.Error correction all NReplace deletions, confirmed by a single read, with either allele value that is present in all other reads or N Remove insertions, confirmed by a single read

Distinguish rare mutations from genotyping errors ViSpA = Replace unique outliers ShoRAH (Zagordi et al, 2010) – Probabilistic clustering – 3 overlapping windows EDAR (Zhao et al, 2010) – Count of k-mers – Works for qsps Read Error Correction

Read Graph: Vertices Subread with n mismatches Subread = completely contained in some read with ≤ n mismatches. Superread Superread = not a subread => the vertex in the read graph. ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT

Read Graph: Edges Several paths may represent the same sequence. Edge b/w two vertices if there is an overlap between superreads and they agree on their overlap with ≤ m mismatches Transitive reduction

Edge Cost Choose the most probable source-sink path through each vertex. Cost measures the uncertainty that two superreads belong to the same quasispecies. OverhangΔ Overhang Δ is the shift in start positions of two overlapping superreads. Δ j where j is the number of mismatches oε in overlap o, ε is 454 error rate.

Path to Sequence The s-t-Max Bandwidth Path per vertex (maximizing probability) 1.Build coarse sequence out of path’s superreads: N – For each position: >70%-majority if it exists, otherwise N 2.Replace coarse sequence with weighted consensus obtained on all reads 3.Select unique sequences out of constructed sequences. Repetitive sequences = evidence of real qsps sequence

Expectation Maximization Bipartite graph: – Q q is a candidate with frequency f q – R r is a read with observed frequency o r – Weight h q,r = probability that read r is produced by quasispecies q with j mismatches E step: M step:

Experimental Validation Simulations – Error-free reads from known HCV quasispecies FlowSim – Reads with errors generated by FlowSim (Balser et al 2010/Sept) Real 454 reads – HCV data – HIV data (10 qsps) ShoRAH Comparison with ShoRAH

Simulations: Error-Free Reads 44 real qsps (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) Simulated reads: – 4 populations sizes: 10, 20, 30, 40 sequences – Geometric distribution – The quasispecies population: Number of reads is (20K, 40K, 60K, 80K, 100K } N(μ,400)μ The read length distribution N(μ,400); μ is varied from 200 to 500

Results

Simulations with FlowSim 44 real quasispecies sequences (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) 30K reads with average length 350bp 100 bootstrapping tests on 10% - reduced data ‒ For the i-th (i = 1,.., 10) most frequent sequence assembled on the whole data, we record its reproducibility = percentage of runs when there is a match (exact or with at most k mismatches) among 10 most frequent sequences found on reduced data.

Bootstraping Tests ShoRAH outperforms ViSpA due to its read correction. If ViSpA is used on ShoRAH-corrected reads (ShoRAHreads+ViSpA), the results drastically improves ViSpA is better in assembling sequences => ViSpA is better in assembling sequences

454 Reads of HCV Qsps (Courtesy P. Balfe) reads from 5.2Kb-long region of HCV-1a genomes from intravenous drug user being infected for less than 3 months Segemehl software – reads (average read length 292bp) – ~77% of reads has at least one indel – ~7% of reads with at least one N

NJ Tree for 10 Most Frequent Qsps: ShoRAH + ViSpA ShoRAH 1 qsps with viable protein ViSpA: 10 qsps with viable proteins Top 20: 16 are viable.

Robustness ShoRAH : 35% of times infers only the 3rd most frequent sequence. ViSpA repeats 7 sequences >= 15% times and the top sequence is repeated 40% times.

Systematic Sequencing Errors Stop codons in amino-acid sequences – ShoRAH: only 1 out of top 10 corresponds to viable protein – ViSpA: 16 out of 20 sequences represent viable proteins (Manual) Resolution method for quasispecies sequences: – Find the frame (MSTNP ) – Find the first stop-codon position in qsp – Align the amino-acid translations of qsp and the reference – In the alignment go left from the stop-codon until the correct alignment – find first nucleotide monomer to the left – Try to extend or to reduce the monomer by one base and choose the one which matches the reference

Example Reference Contig

454 Reads of HIV Qsps (Zagordi et al.2010) 55,611 reads from ten 1.5Kbp long region of HIV-1 (average read length 345bp) – No removal of low-quality reads – ~99% of reads has at least one indel – ~11.6 % of reads with at least one N 2 qsps <=4 ShoRAH correctly infers only 2 qsps sequences with <=4 mismatches. 5 qsps <=2 ViSpA correctly infers 5 qsps with <=2 mismatches, 2 qsps are inferred exactly. 3 ShoRAHreads+ViSpA infers 3 qsps exactly.

Summary Viral Spectrum Assembler (ViSpA) tool – Simple error correction – Qsps assembling based on maximum-bandwidth paths in weighted read graphs – Frequency estimation via EM on all reads – Freely available at ShoRAH’s error correction algorithm is prone to overcorrection ViSpA is better than ShoRAH in assembling sequences

Outline Next-gen sequencing technologies Overview of NGS applications Genotyping and variant detection Transcriptome analysis Reconstruction of viral quasispecies Conclusions

The range of NGS applications continues to expand, fueled by advances in technology Improved sample prep protocols 3 rd generation: Pacific Biosciences, Ion Torrent Development of sophisticated analysis methods remains critical for fully realizing the potential of sequencing technologies

Further readings Error correction  Zhao X, Palmer LE, Bolanos R, Mircean C, Fasulo D, Wittenberg GM. “EDAR: an efficient error detection and removal algorithm for next generation sequencing data,” J Comp. Biol (11):  Ilie L, Fazayeli F, Ilie S. HiTEC: accurate error correction in high-throughput sequencing data, Bioinformatics 27(3) (2011) Read mapping  Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25.  Kurtz S, Sharma CM, Khaitovich P, Vogel J., Stadler, PF, Hoffmann S, Otto C, and Hackermuller J. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol, 5(9):e ,  Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111.

Further readings SNV discovery and genotyping  H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(1):1851–1858,  R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. SNP detection for massively parallel whole-genome resequencing. Genome Research, 19:1124–1132,  I. Chepelev, G. Wei, Q. Tang, and K. Zhao. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Research, 37(16):e106,  J. Duitama and J. Kennedy and S. Dinakar and Y. Hernandez and Y. Wu and I.I. Mandoiu, Linkage Disequilibrium Based Genotype Calling from Low-Coverage Shotgun Sequencing Reads, BMC Bioinformatics 12(Suppl 1):S53,  J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants fromWhole Transcriptome Sequencing Data, Proc. 1st IEEE International Conference on Computational Advances in Bio and Medical Sciences, pp ,  S.Q. Le and R. Durbin: SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome research, to appear.

Further readings Estimation of gene expression levels from RNA-Seq data  Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 2008, 5(7):621–628.  Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 2009, 25(8):1026–1032.  Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 2010, 26(4):493–500.  Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 2010, 28(5):511–515.  M. Nicolae and S. Mangul and I.I. Mandoiu and A. Zelikovsky, Estimation of alternative splicing isoform frequencies from RNA-Seq data, Algorithms for Molecular Biology, to appear, preliminary version in Proc. WABI  Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology 2011, 12(3):R22.

Estimation of gene expression levels from DGE data  Y. Asmann, E.W. Klee, E.A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben, and J.-P. Kocher. 3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics, 10(1):531,  Z.J. Wu, C.A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J.S. Liu, K. Polyak, and X.S. Liu. Gene expression profiling of human breast tissue samples using SAGE-Seq. Genome Research, 20(12):1730–1739,  M. Nicolae and I.I. Mandoiu, Accurate Estimation of Gene Expression Levels from DGE Sequencing Data, Proc. ISBRA 2011, to appear. Further readings

Viral quasispecies reconstruction  Zagordi O, Klein R, Daumer M, and Beerenwinkel N. “Error correction of next- generation sequencing data and reliable estimation of HIV quasispecies,” Nucleic Acids Research, 38(21):7400–7409,  Prosperi M, Prosperi L, Bruselles A, Abbate I, Rozera G, Vincenti D, Solmone M, Capobianchi M, and Ulivi G. “Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics, 12(1):5+, 2011  Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A. “Inferring Viral Quasispecies Spectra from 454 Pyrosequencing Reads,” BMC Bioinformatics, to appear.  Zagordi O, Bhattacharya A, Eriksson N, and Beerenwinkel N. “ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data,” BMC Bioinformatics, to appear.  Balser S, Malde K, Lanzen A, Sharma A, and Jonassen I. “Characteristics of 454 pyrosequencing data–enabling realistic simulation with FlowSim,” Bioinformatics, 26:i420–5, Further readings

Software packages Genotyping and variant detection o LD-based genotype calling: o SNV detection and genotyping from RNA-Seq reads: Inference of gene expression levels o From RNA-Seq reads: o From DGE reads: Reconstruction of viral quasispecies o

Acknowledgments NSF (awards , , and ) National Institute of Food and Agriculture (award ) UCONN Research Foundation (UCIG grant) GSU Molecular Basis of Disease Fellowship Jorge Duitama (KU Leuven) Marius Nicolae (UConn) Justin Kennedy (Sonalysts) Sanjiv Dinakar (UMD) Yozen Hernández (Hunter College) Pramod K. Srivastava (UCHC) Irina Astrovskaya (GSU) Bassam Tork (GSU) Serghei Mangul (GSU) Kelly Westbrooks (Life Tech) Dumitru Brinza (Life Tech) Peter Balfe (Birmingham U.) Pavel Skums (CDC) Yuri Khudyakov (CDC)