Bioinformatics Pipelines for RNA-Seq Data Analysis

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
Transcriptome Sequencing with Reference
Next-generation sequencing
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Data Analysis for High-Throughput Sequencing
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Bioinformatics Tools for Personalized Cancer Immunotherapy
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Introduction to Next-Generation Sequencing
Next generation sequencing Xusheng Wang 4/29/2010.
Li and Dewey BMC Bioinformatics 2011, 12:323
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
RNA-seq: Quantifying the Transcriptome
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
The iPlant Collaborative
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
GCC Workshop 9 RNA-Seq with Galaxy
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Reference based assembly
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Bioinformatics Pipelines for RNA-Seq Data Analysis Ion Măndoiu and Sahar Al Seesi Computer Science & Engineering Department University of Connecticut BIBM 2011 Tutorial

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Outline Background RNA-Seq read mapping NGS technologies RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Cost of DNA Sequencing http://www.economist.com/node/16349358 4

2nd Gen. Sequencing: Illumina -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead 5

2nd Gen. Sequencing: Illumina -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead 6

2nd Gen. Sequencing: SOLiD Emulsion PCR used to perform single molecule amplification of pooled library onto magnetic beads

2nd Gen. Sequencing: SOLiD

2nd Gen. Sequencing: SOLiD

2nd Gen. Sequencing: SOLiD

2nd Gen. Sequencing: SOLiD

2nd Gen. Sequencing: 454

2nd Gen. Sequencing: Ion Torrent PGM High-density array of micro-machined wells Each well holds a different clonally amplified DNA template generated by emulsion PCR Beneath the wells is an ion-sensitive layer and beneath that a proprietary Ion sensor The sequencer sequentially floods the chip with one nucleotide after another (natural nucleotides) If currently flooded nucleotide complements next base on template, a voltage change is recorded

3nd Gen. Sequencing PacBio SMRT Nanopore sequencing Gamma labelled dNTPs release label after incorporation. Product is not labelled; detect the base Key is small detection volume ( illuminated with laser beam) Signal from dNTP held in place while it is being incorporated (“tens of milliseconds”) Background from unincorporated dNTPs diffusing in and out of detection volume (“a few microseconds”) Nanopore sequencing

Cost/Performance Comparison [Glenn 2011]

A transformative technology Re-sequencing De novo genome sequencing RNA-Seq Non-coding RNAs Structural variation ChIP-Seq Methyl-Seq Metagenomics Viral quasispecies Shape-Seq … many more biological measurements “reduced” to NGS sequencing 16

Outline Background RNA-Seq read mapping Mapping strategies Merging read alignments Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Mapping RNA-Seq Reads http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png 18

Mapping Strategies for RNA-Seq reads Short reads (Illumina, SOLiD) Ungapped mapping (with mismatches) on genome Leverages existing tools: bowtie, BWA, … Cannot align reads spanning exon-junctions Mapping on transcript libraries Cannot align reads from un-annotated transcripts Mapping on exon-exon junction libraries Cannot align reads overlapping un-annotated exons Spliced alignment on the genome Similar to classic EST alignment problem, but harder due to short read length and large number of reads Tools: QPLAMA [De Bona et al. 2008], Tophat [Trapnell et al. 2009], MapSplice [Wang et al. 2010] Hybrid approaches Long read mapping (454, ION Torrent) Local alignment (Smith-Waterman) to the genome Handles indel errors characteristic of current long read technologies

Spliced Read Alignment with Tophat C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, 2009.

Hybrid Approach Based on Merging Alignments mRNA reads Transcript Library Mapping Genome Mapping Read Merging Transcript mapped reads Genome mapped reads Mapped reads

Alignment Merging for Short Reads Genome Transcripts Agree? Hard Merge Soft Merge Unique Yes Keep No Throw Multiple Not Mapped Not mapped

Merging Of Local Alignments

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads SNVQ algorithm Experimental results Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Motivation RNA-Seq is much less expensive than genome sequencing Can sequence variants be discovered reliably from RNA-Seq data? SNVQ: novel Bayesian model for SNV discovery and genotyping from RNA-Seq data [Duitama et al., ICCABS 2011 ] Particularly appropriate when interest is in expressed mutations (cancer immunotherapy)

SNP Calling from Genomic DNA Reads Read sequences & quality scores Reference genome sequence @HWI-EAS299_2:2:1:1536:631 GGGATGTCAGGATTCACAATGACAGTGCTGGATGAG +HWI-EAS299_2:2:1:1536:631 ::::::::::::::::::::::::::::::222220 @HWI-EAS299_2:2:1:771:94 ATTACACCACCTTCAGCCCAGGTGGTTGGAGTACTC +HWI-EAS299_2:2:1:771:94 :::::::::::::::::::::::::::2::222220 >ref|NT_082868.6|Mm19_82865_37:1-3688105 Mus musculus chromosome 19 genomic contig, strain C57BL/6J GATCATACTCCTCATGCTGGACATTCTGGTTCCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCA ACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATTTCCACGCTTTTTCACTACAGATAAAG AACTGGGACTTGCTTATTTACCTTTAGATGAACAGATTCAGGCTCTGCAAGAAAATAGAATTTTCTTCAT ACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACAAGATAAGAGTCAATGCATATCCTTGTATAAT Read Mapping SNP calling 1 4764558 G T 2 1 1 4767621 C A 2 1 1 4767623 T A 2 1 1 4767633 T A 2 1 1 4767643 A C 4 2 1 4767656 T C 7 1 26

SNV Detection and Genotyping Locus i Reference AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Ri r(i) : Base call of read r at locus i εr(i) : Probability of error reading base call r(i) Gi : Genotype at locus i

SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

Current Models Maq: SOAPsnp Keep just the alleles with the two largest counts Pr (Ri | Gi=HiHi) is the probability of observing k alleles r(i) different than Hi Pr (Ri | Gi=HiH’i) is approximated as a binomial with p=0.5 SOAPsnp Pr (ri | Gi=HiH’i) is the average of Pr(ri|Hi) and Pr(ri|Gi=H’i) A rank test on the quality scores of the allele calls is used to confirm heterozygocity

SNVQ Model Calculate conditional probabilities by multiplying contributions of individual reads

Experimental Setup 113 million 32bp Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 (NCBI SRA database accession numbers SRX000565 and SRX000566) We tested genotype calling using as gold standard 3.4 million SNPs with known genotypes for NA12878 available in the database of the Hapmap project True positive: called variant for which Hapmap genotype coincides False positive: called variant for which Hapmap genotype does not coincide

Comparison of Mapping Strategies

Comparison of Variant Calling Strategies

Data Filtering

Data Filtering Allow just x reads per start locus to eliminate PCR amplification artifacts [Chepelev et al. 2010] algorithm: For each locus groups starting reads with 0, 1 and 2 mismatches Choose at random one read of each group

Comparison of Data Filtering Strategies

Accuracy per RPKM bins

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq IsoEM algo Experimental results Alternative protocols and inference problems DGE protocol Inference of allele specific expression levels Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Alternative splicing [Griffith and Marra 07]

Pal S. et all , Genome Research, June 2011 Alternative Splicing Pal S. et all , Genome Research, June 2011

Computational Problems Make cDNA & shatter into fragments Sequence fragment ends A B C D E Map reads Gene Expression (GE) Isoform Expression (IE) A B C D E Isoform Discovery (ID)

Challenges to accurate estimation of gene expression levels Read ambiguity (multireads) What is the gene length? A B C D E

Previous Approaches to GE Ignore multireads [Mortazavi et al. 08] Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE A B C D E

Previous Approaches to IE [Jiang&Wong 09] Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] EM Algorithm, single reads [Feng et al. 10] Convex quadratic program, pairs used only for ID [Trapnell et al. 10] Extends Jiang’s model to paired reads Fragment length distribution

IsoEM algorithm [Nicolae et al. 2011] Unified probabilistic model and Expectation-Maximization Algorithm (IsoEM) for IE considering Single and/or paired reads Fragment length distribution Strand information Base quality scores Repeat and hexamer bias correction

Read-isoform compatibility

Fragment length distribution Paired reads Fa(i) Fa (j) i A C B A B C A B C j

Fragment length distribution Single reads Fa(i) Fa (j) i A B C A B C A B C j

IsoEM pseudocode E-step M-step

Implementation details Collapse identical reads into read classes (i3,i4) (i1,i2) (i3,i4) (i3,i5) Reads LCA(i3,i4) Isoforms i1 i2 i3 i4 i5 i6

Implementation details Run EM on connected components, in parallel i2 i4 Isoforms i1 i3 i5 i6

Simulation setup Human genome UCSC known isoforms GNFAtlas2 gene expression levels Uniform/geometric expression of gene isoforms Normally distributed fragment lengths Mean 250, std. dev. 25

Accuracy measures Error Fraction (EFt) Median Percent Error (MPE) r2 Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) Threshold t for which EF is 50% r2

Error fraction curves - isoforms 30M single reads of length 25 (simulated)

Error fraction curves - genes 30M single reads of length 25 (simulated)

MPE and EF15 by gene expression level 30M single reads of length 25

Read length effect on IE MPE Fixed sequencing throughput (750Mb) Single Reads Paired Reads

Read length effect on IE r2 Fixed sequencing throughput (750Mb)

Effect of pairs & strand information 75bp reads

Runtime scalability Scalability experiments conducted on a Dell PowerEdge R900 Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal memory

MAQC data RNA samples: UHRR, HBRR qPCR 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] Bases called using both auto and phi X calibration for 2 libraries qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]

MPE comparison for MAQC samples

r2 comparison for MAQC samples

R2 of IsoEM estimates from ION Torrent & Illumina HBR reads Average R2 for 5 ION Torrent MAQC HBR Runs (avg. 1,559,842 reads) R2 for combined reads from 5 ION Torrent MAQC HBR Runs (7,799,210 reads)

DGE/SAGE-Seq protocol AAAAA Cleave with anchoring enzyme (AE) AAAAA CATG AE TCCRAC AAAAA CATG AE TE Attach primer for tagging enzyme (TE) Cleave with tagging enzyme CATG Map tags A B C D E Gene Expression (GE)

Inference algorithms for DGE data Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10] Heuristic rescue of some ambiguous tags [Wu et al. 10] DGE-EM algorithm [Nicolae & Mandoiu, ISBRA 2011] Uses all tags, including all ambiguous ones Uses quality scores Takes into account partial digest and gene isoforms

Tag formation probability

Tag-isoform compatibility

DGE-EM algorithm assign random values to all f(i) while not converged E-step init all n(i,j) to 0 for each tag t for (i,j,w) in t M-step for each isoform i

MAQC data DGE RNA-Seq qPCR 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09] Anchoring enzyme DpnII (GATC) RNA-Seq 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] qPCR Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]

DGE-EM vs. Uniq on HBRR Library 4

DGE vs. RNA-Seq

DGE vs. RNA-Seq

Synthetic data 1-30M tags, lengths 14-26bp UCSC hg19 genome and known isoforms Simulated expression levels Gene expression for 5 tissues from the GNFAtlas2 Geometric expression for the isoforms of each gene Anchoring enzymes from REBASE DpnII (GATC) [Asmann et al. 09] NlaIII (CATG) [Wu et al. 10] CviJI (RGCY, R=G or A, Y=C or T) 75

Anchoring enzyme statistics

MPE for 30M 21bp tags RNA-Seq: 8.3 MPE

DGE vs. RNA-Seq Summary RNA-Seq and DGE based estimates have comparable cost-normalized accuracy on MAQC data When using best inference algorithm for each type of data Simulations suggest possible DGE protocol improvements Enzymes with degenerate recognition sites (e.g. CviJI) Optimizing cutting probability

Allele Specific Expression in F1 Hybrids [McManus et al. 10] 26M 42M 31M 78M Paired-end reads (37bp)

Analysis Pipeline for Allele-Specific Isoform Expression in F1 Hybrids

Allele Specific Expression on Drosophila RNA-Seq data from [McManus et al. 10]

Allele Specific Expression for Mouse RNA-Seq Data from [Gregg et al

General Pipeline for Allele-Specific Isoform Expression

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Running analyses, creating flows, and adding tools in Galaxy Hands on exercise Novel transcript reconstruction Conclusions

Galaxy Web-based platform for bioinformatics analysis Aims to facilitate reproducing results Provides user friendly interface to many available tools Free public server (maintained by PSU) Downloadable galaxy instance for installation and expansion (adding tools)

Local Galaxy Instance http://rna1.engr.uconn.edu:7474/ Lab Tools NGS: IsoEM SNVQ Tools available on the PSU server

Adding Tools to Local Galaxy Instances Galaxy Wiki for tool configuration syntax http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Overview of existing approaches DRUT algorithm Conclusions

Existing approaches Genome-guided reconstruction (ab initio) Exon identification Genome-guided assembly Genome independent (de novo) reconstruction Genome-independent assembly Annotation-guided reconstruction Explicitly use existing annotation during assembly

Genome-guided reconstruction Scripture (2010), IsoLasso (2011) Reports “all” isoforms Cufflinks (2010) Reports a minimal set of isoforms Trapnell, M. et al May 2010, Guttman, M. et al May 2010

Genome independent reconstruction Trinity (2011),Velvet (2008), TransABySS (2008) Euler/de Brujin k-mer graph Grabherr, M. et al. Nat. Biotechnol. July 2011

Garber, M. et al. Nat. Biotechnol. June 2011 GGR vs GIR (a) Reads originating from two different isoforms of the same genes are colored black and gray. In genome-guided assembly, reads are first mapped to a reference genome, and spliced reads are used to build a transcript graph, which is then parsed into gene annotations. In the genome-independent approach, reads are broken into k-mer seeds and arranged into a de Bruijn graph structure. The graph is parsed to identify transcript sequences, which are aligned to the genome to produce gene annotations. Garber, M. et al. Nat. Biotechnol. June 2011

Garber, M. et al. Nat. Biotechnol. June 2011 Max Set vs Min Set Spliced reads give rise to four possible transcripts, but only two transcripts are needed to explain all reads; the two possible sets of minimal isoforms are depicted. Garber, M. et al. Nat. Biotechnol. June 2011

Cufflinks G(V,E) V – pe reads E – compatible reads Figure 6. Compatibility and incompatibility of fragments. End-reads are solid lines, unknown sequences within fragments are shown by dotted lines and implied introns are dashed lines. The reads in (a) are compatible, whereas the fragments in (b) are incompatible. The fragments in (c) are nested. Fragment x4 in (d) is uncertain, because y4 and y5 are incompatible with each other Fragment x4 in (d) is uncertain, because y4 and y5 are incompatible with each other

Scripture Connectivity graph Filter isoforms V – bases E – spliced event Filter isoforms Coverage (p-value) Insert length

Other statistical assembly IsoLasso multivariate regression method – Lasso balance between the maximization of prediction accuracy and the minimization of interpretation SLIDE sparse estimation as a modified Lasso limiting the number of discovered isoforms and favoring longer isoforms Li, W. et al. RECOMB 2011, Li J et all Proc Natl Acad Sci. USA 2011

Reconstruction Strategies Comparison Grabherr, M. et al. Nat. Biotechnol. May 2011

Detection and Reconstruction of Unannotated Transcripts a) Map reads to annotated transcripts (using Bowtie) b) VTEM: Identify overexpressed exons (possibly from unannotated transcripts) c) Assemble Transcripts (e.g., Cufflinks) using reads from “overexpressed” exons and unmapped reads d) Output: annotated transcripts + novel transcripts

Virtual Transcript Expectation Maximization (VTEM) VTEM is based on a modification of Virtual String Expectation Maximization (VSEM) Algorithm [Mangul et al. 2011]). the difference is that we consider in the panel exons instead of reads Calculate observed exon counts based on read mapping each read contribute to count of either one exon or two exons (depending if it is a unspliced spliced read or spliced read) R1 R2 R4 reads R3 transcripts T1 T2 R1 R2 R4 reads R3 exon counts transcripts T1 T2 E1 E2 exons E3 - How? We can see that R1, R2, 3 and 4 were emitted by Transcript 1 and also T3 also emitts 2,3,4. but each tr. consists of several exons and tr. shares exons -> for a more accurate estimation we will consider exons Transcripts consists of exons and reads can be emitted from a single exon or from two exons if they are spliced reads. 1 3

Input: Complete vs Partial Annotations Complete Annotations Partial Annotations O 0.25 exons O 0.25 exons E1 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 T3 E4 E4 Observed exon frequencies Transcript T3 is missing from annotations

Virtual Transcript Expectation Maximization (VTEM) ML estimates of transcript frequencies Compute expected exons Update weights of reads in virtual transcript EM (Partially) Annotated Genome + Virtual Transcript with 0-weights in virtual transcript Virtual Transcript frequency change>ε? Output overexpressed exons (expressed by virtual transcripts) YES NO Use reads from overexpressed exons as input in Cufflinks for reconstruction of unknown transcripts. Overexpressed exons belong to unknown transcripts

Reads simulated from UCSC known genes Simulation Setup Reads simulated from UCSC known genes 19, 372 genes 66, 803 isoforms Single end, error-free 60M reads of length 100bp To simulate incomplete annotation, remove from every gene exactly one isoform

Comparison Between Methods Sensitivity - portion of the annotated transcript sequences being captured by candidate transcript sequences

Outline Background RNA-Seq read mapping Variant detection and genotyping from RNA-Seq reads Transcriptome quantification using RNA-Seq Implementing RNA-Seq analysis pipelines using Galaxy Novel transcript reconstruction Conclusions

Conclusions The range of NGS applications continues to expand, fueled by advances in technology Improved sample prep protocols 3rd generation: Pacific Biosciences, Ion Torrent Development of sophisticated analysis methods remains critical for fully realizing the potential of sequencing technologies Comparison with alternative protocol for measuring gene expr called digital gene expr The techniques behind isoem are applicable to reconstruction & freq est for virus quasispecies

Further readings Read mapping Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25 Heng Li andRichard Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics (2009) 25(14): 1754-1760 Kurtz S, Sharma CM, Khaitovich P, Vogel J., Stadler, PF, Hoffmann S, Otto C, and Hackermuller J. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol, 5(9):e1000502, 2009 Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111

Further readings SNV discovery and genotyping H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(1):1851–1858, 2008. R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. SNP detection for massively parallel whole-genome resequencing. Genome Research, 19:1124–1132, 2009. I. Chepelev, G. Wei, Q. Tang, and K. Zhao. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Research, 37(16):e106, 2009. J. Duitama and J. Kennedy and S. Dinakar and Y. Hernandez and Y. Wu and I.I. Mandoiu, Linkage Disequilibrium Based Genotype Calling from Low-Coverage Shotgun Sequencing Reads, BMC Bioinformatics 12(Suppl 1):S53, 2011. J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants fromWhole Transcriptome Sequencing Data, Proc. 1st IEEE International Conference on Computational Advances in Bio and Medical Sciences, pp. 87-92, 2011. S.Q. Le and R. Durbin: SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome research, to appear.

Further readings Estimation of gene expression levels from RNA-Seq data Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 2008, 5(7):621–628. Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 2009, 25(8):1026–1032. Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 2010, 26(4):493–500. Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 2010, 28(5):511–515. M. Nicolae, S. Mangul, I.I. Mandoiu, and A. Zelikovsky. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology, 6:9, 2011. Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology 2011, 12(3):R22.

Further readings Estimation of gene expression levels from DGE data Y. Asmann, E.W. Klee, E.A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben, and J.-P. Kocher. 3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics, 10(1):531, 2009. Z.J. Wu, C.A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J.S. Liu, K. Polyak, and X.S. Liu. Gene expression profiling of human breast tissue samples using SAGE-Seq. Genome Research, 20(12):1730–1739, 2010. M. Nicolae and I.I. Mandoiu. Accurate estimation of gene expression levels from DGE sequencing data. In Proc. 7th International Symposium on Bioinformatics Research and Applications, pp. 392-403, 2011.

Further readings Novel transcript reconstruction Manuel Garber, Manfred G Grabherr, Mitchell Guttman, and Cole Trapnell, Computational methods for transcriptome annotation and quantification using RNA-seq, Nature Methods 8, 469-477, 2011 Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 2010, 28(5):511–515. Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John L Rinn, Eric S Lander, and Aviv Regev. Ab initio reconstruction of cell type specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology 28, 503-510, 2010 S. Mangul, A. Caciula, I. Mandoiu, and A. Zelikovsky. RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes, Proc. BIBM 2011, pp.118-123

Software packages SNV detection and genotyping from RNA-Seq reads: http://dna.engr.uconn.edu/software/NGSTools Inference of gene expression levelsFrom RNA-Seq reads: http://dna.engr.uconn.edu/software/IsoEM/ Inference of gene expression levels from DGE reads: http://www.dna.engr.uconn.edu/software/DGE-EM

Galaxy References Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. UCONN Galaxy instance: http://rna1.engr.uconn.edu:7474/ Main Galaxy server at PSU: http://galaxy.psu.edu/

Acknowledgments Alex Zelikovsky (GSU) Jorge Duitama (KU Leuven) Serghei Mangul (GSU) Adrian Caciula (GSU) Dumitru Brinza (Life Tech) Jorge Duitama (KU Leuven) Marius Nicolae (Uconn) Pramod Srivastava (UCHC)