Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu
Advances in Next Generation Sequencing Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD 4/ billion PE reads/run 35-50bp read length Ion Proton Sequencer 2
RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction Isoform Expression 3
Transcriptome Assembly Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. 4
Transcriptome Assembly Types Genome-independent reconstruction (de novo) – de Brujin k-mer graph Genome-guided reconstruction (ab initio) – Spliced read mapping – Exon identification – Splice graph Annotation-guided reconstruction – Use existing annotation (known transcripts) – Focus on discovering novel transcripts 5
Previous approaches Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008) Genome-guided reconstruction – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012), CLIIQ(2012), TRIP(2012), Traph (2013) Minimizes set of transcripts explaining reads Annotation-guided reconstruction – RABT(2011), DRUT(2011) 6
Gene representation Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events Gene - set of non-overlapping pseudo-exons e1e1 e3e3 e5e5 e2e2 e4e4 e6e6 S pse1 E pse1 S pse2 E pse2 S pse3 E pse3 S pse4 E pse4 S pse5 E pse5 S pse6 E pse6 S pse7 E pse7 Pseudo- exons: e1e1 e5e5 pse 1 pse 2 pse 3 pse 4 pse 5 pse 6 pse 7 Tr 1 : Tr 2 : Tr 3 : 7
Splice Graph Genome TSS pseudo-exons TES 8
Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Select candidate transcripts – IsoEM – greedy algorithm 9 Genome MaLTA Maximum Likelihood Transcriptome Assembly
How to select? Select the smallest set of candidate transcripts covering all transcript variants Transcript : set of transcript variants 10 Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res : alternative first exon alternative last exon exon skipping intron retention alternative 5' splice junction splice junction
IsoEM: Isoform Expression Level Estimation Expectation-Maximization algorithm Unified probabilistic model incorporating – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Repeat and hexamer bias correction
Read-isoform compatibility graph
Fragment length distribution ABC AC ABC AC ABC AC i j F a (i) F a (j)
Greedy algorithm 14 1.Sort transcripts by inferred IsoEM expression levels in decreasing order 2.Traverse transcripts – Select transcripts if it contains novel transcript variant – Continue traversing until all transcript variant are covered
Greedy algorithm 15 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 16 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 17 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 18 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 19 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 20 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 21 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 22 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 23 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 24 Transcript Variants: Transcripts sorted by expression levels
Greedy algorithm 25 Transcript Variants: Transcripts sorted by expression levels STOP. All transcript variant are covered.
MaLTA results on GOG-350 dataset 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2 Number of assembled transcripts – MaLTA : – Cufflinks : Number of transcripts matching annotations – MaLTA : 4555(26%) – Cufflinks : 2031(13%) 26
Expression Estimation on Ion Torrent reads Squared correlation – IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal
Conclusions Novel method for transcriptome assembly Validated on Ion Torrent RNA-Seq Data Comparing with Cufflinks: – similar number of assembled transcripts – 2x more previously annotated transcripts Transcript quantification is useful for transcript assembly better quantification? 28
29