Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
Fast and accurate short read alignment with Burrows–Wheeler transform
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
High Throughput Sequencing
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
Computational Methods for Analysis of Single Cell RNA-Seq Data
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Short Read Sequencing Analysis Workshop
VCF format: variants c.f. S. Brown NYU
A Fast Hybrid Short Read Fragment Assembly Algorithm
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Alexander Zelikovsky Computer Science Department
Transcriptome Assembly
Reference based assembly
Transcriptome analysis
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu

Advances in Next Generation Sequencing Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD 4/ billion PE reads/run 35-50bp read length Ion Proton Sequencer 2

RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction Isoform Expression 3

Transcriptome Assembly Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. 4

Transcriptome Assembly Types Genome-independent reconstruction (de novo) – de Brujin k-mer graph Genome-guided reconstruction (ab initio) – Spliced read mapping – Exon identification – Splice graph Annotation-guided reconstruction – Use existing annotation (known transcripts) – Focus on discovering novel transcripts 5

Previous approaches Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008) Genome-guided reconstruction – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012), CLIIQ(2012), TRIP(2012), Traph (2013) Minimizes set of transcripts explaining reads Annotation-guided reconstruction – RABT(2011), DRUT(2011) 6

Gene representation Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events Gene - set of non-overlapping pseudo-exons e1e1 e3e3 e5e5 e2e2 e4e4 e6e6 S pse1 E pse1 S pse2 E pse2 S pse3 E pse3 S pse4 E pse4 S pse5 E pse5 S pse6 E pse6 S pse7 E pse7 Pseudo- exons: e1e1 e5e5 pse 1 pse 2 pse 3 pse 4 pse 5 pse 6 pse 7 Tr 1 : Tr 2 : Tr 3 : 7

Splice Graph Genome TSS pseudo-exons TES 8

Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Select candidate transcripts – IsoEM – greedy algorithm 9 Genome MaLTA Maximum Likelihood Transcriptome Assembly

How to select? Select the smallest set of candidate transcripts covering all transcript variants Transcript : set of transcript variants 10 Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res : alternative first exon alternative last exon exon skipping intron retention alternative 5' splice junction splice junction

IsoEM: Isoform Expression Level Estimation Expectation-Maximization algorithm Unified probabilistic model incorporating – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Repeat and hexamer bias correction

Read-isoform compatibility graph

Fragment length distribution ABC AC ABC AC ABC AC i j F a (i) F a (j)

Greedy algorithm 14 1.Sort transcripts by inferred IsoEM expression levels in decreasing order 2.Traverse transcripts – Select transcripts if it contains novel transcript variant – Continue traversing until all transcript variant are covered

Greedy algorithm 15 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 16 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 17 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 18 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 19 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 20 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 21 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 22 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 23 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 24 Transcript Variants: Transcripts sorted by expression levels

Greedy algorithm 25 Transcript Variants: Transcripts sorted by expression levels STOP. All transcript variant are covered.

MaLTA results on GOG-350 dataset 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2 Number of assembled transcripts – MaLTA : – Cufflinks : Number of transcripts matching annotations – MaLTA : 4555(26%) – Cufflinks : 2031(13%) 26

Expression Estimation on Ion Torrent reads Squared correlation – IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal

Conclusions Novel method for transcriptome assembly Validated on Ion Torrent RNA-Seq Data Comparing with Cufflinks: – similar number of assembled transcripts – 2x more previously annotated transcripts Transcript quantification is useful for transcript assembly  better quantification? 28

29