Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x(p) - 1 if the pe read p is selected to be mapped within 1 std. dev. – n( s 1 ) - expected portion of reads mapped within 1 std. dev. Objective: Constraints: GIR : Genome independent reconstruction (de novo) – k-mer graph – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012) Reports a minimal set of transcripts GGR : Genome-guided reconstruction (ab initio) – Exon identification – Genome-guided assembly – Trinity(2011), Velvet(2008), TransABySS(2008) Euler/de Brujin k-mer graph AGR : Annotation-guided reconstruction – Explicitly use existing annotation during assembly – RABT(2011) Simulate reads from annotated transcripts Alternative Splicing [Griffith and Marra 07] In this work, we propose a novel statistical “genome guided” method called “Transcriptome Reconstruction using Integer Programming” (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based an exact annotation of exon boundaries and RNA-Seq reads. The exact annotation of exons can be obtained from annotation databases (e.g., Ensembl) or can be inferred from aligned RNA- Seq reads. A splice graph is a directed acyclic graph (DAG), whose vertices represent exons and edges represent splicing events. We enumerate all maximal paths in the splice graph using a depth-first-search (DFS) algorithm. These paths correspond to putative transcripts and are the input for the TRIP algorithm. ABSTRACT INTRODUCTION Advances in Next Generation Sequencing Classification and Existing Approaches Splice Graph Preliminary Results (in silico) CONCLUSIONS AND FUTURE WORK TRIP Algorithm for novel transcript reconstruction – fragments length distribution Ongoing work – Comparison to other transcriptome reconstructions methods – Comparison on real datasets Solid pe reads Illumina 100x2 pe reads ACKNOWLEDGEMENTS NSF award IIS NSF award IIS Agriculture and Food Research Initiative Competitive Grant no from the USDA National Institute of Food and Agriculture Second Century Initiative Bioinformatics University Doctoral Fellowship, Georgia State University *Georgia State University, **University of Connecticut Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads t 4 : t 1 : t 2 : t 3 : Exon 2 and 6 are “distant” exons : how to phase them? 100x coverage, 2x100bp pe reads annotations for genes and exon boundaries GGR vs. GIR Challenges and Solutions Read length is currently much shorter then transcripts length Statistical reconstruction method - fragment length distribution Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Filter candidate transcripts – fragment length distribution (FLD) – integer programming TRIP Transcriptome Reconstruction using Integer Programming Genome single reads exons pseudo-exons Garber, M. et al. Nat. Biotechnol. June Select the smallest set of putative transcripts that yields a good statistical fit between – empirically determined during library preparation – implied by “mapping” read pairs How to filter? ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction transcript/Transcript Expression RNA-Seq Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. Transcriptome Reconstruction Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD 4/ billion PE reads/run 35-50bp read length Ion Proton Sequencer t3t2t1 Mean : 500; Std. dev. 50 Naïve IP formulation 1 std. dev. Objective: Constraints: Variables: – T i (p) - set of candidate transcripts on which pe read p can be mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x i (p) - 1 if the pe read p is selected to be mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 – n( s i ) - expected portion of reads mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 IP formulation 1,2,3,4 std. dev. Human genome UCSC annotations GNFAtlas2 gene expression levels – Uniform/geometric expression of gene transcripts Normally distributed fragment lengths – Mean 500, std. dev. 50 Simulation Setup Sensitivity: - portion of the annotated transcripts being captured by assembly method PPV - portion of annotated transcript among assembled transcripts Accuracy measures Flowchart for TRIP: (a) Positive Predictive Value (PPV) Flowchart for TRIP: (b) Sensitivity