Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Slides:



Advertisements
Similar presentations
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Fast and accurate short read alignment with Burrows–Wheeler transform
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Next Generation DNA Sequencing
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Computational Methods for Analysis of Single Cell RNA-Seq Data
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Kallisto: near-optimal RNA seq quantification tool
Alexander Zelikovsky Computer Science Department
Transcriptome Assembly
Jin Zhang, Jiayin Wang and Yufeng Wu
Reference based assembly
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x(p) - 1 if the pe read p is selected to be mapped within 1 std. dev. – n( s 1 ) - expected portion of reads mapped within 1 std. dev. Objective: Constraints: GIR : Genome independent reconstruction (de novo) – k-mer graph – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012) Reports a minimal set of transcripts GGR : Genome-guided reconstruction (ab initio) – Exon identification – Genome-guided assembly – Trinity(2011), Velvet(2008), TransABySS(2008) Euler/de Brujin k-mer graph AGR : Annotation-guided reconstruction – Explicitly use existing annotation during assembly – RABT(2011) Simulate reads from annotated transcripts Alternative Splicing [Griffith and Marra 07] In this work, we propose a novel statistical “genome guided” method called “Transcriptome Reconstruction using Integer Programming” (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based an exact annotation of exon boundaries and RNA-Seq reads. The exact annotation of exons can be obtained from annotation databases (e.g., Ensembl) or can be inferred from aligned RNA- Seq reads. A splice graph is a directed acyclic graph (DAG), whose vertices represent exons and edges represent splicing events. We enumerate all maximal paths in the splice graph using a depth-first-search (DFS) algorithm. These paths correspond to putative transcripts and are the input for the TRIP algorithm. ABSTRACT INTRODUCTION Advances in Next Generation Sequencing Classification and Existing Approaches Splice Graph Preliminary Results (in silico) CONCLUSIONS AND FUTURE WORK  TRIP Algorithm for novel transcript reconstruction – fragments length distribution  Ongoing work – Comparison to other transcriptome reconstructions methods – Comparison on real datasets Solid pe reads Illumina 100x2 pe reads ACKNOWLEDGEMENTS NSF award IIS NSF award IIS Agriculture and Food Research Initiative Competitive Grant no from the USDA National Institute of Food and Agriculture Second Century Initiative Bioinformatics University Doctoral Fellowship, Georgia State University *Georgia State University, **University of Connecticut Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads t 4 : t 1 : t 2 : t 3 : Exon 2 and 6 are “distant” exons : how to phase them? 100x coverage, 2x100bp pe reads annotations for genes and exon boundaries GGR vs. GIR Challenges and Solutions  Read length is currently much shorter then transcripts length  Statistical reconstruction method - fragment length distribution  Map the RNA-Seq reads to genome  Construct Splice Graph - G(V,E) – V : exons – E: splicing events  Candidate transcripts – depth-first-search (DFS)  Filter candidate transcripts – fragment length distribution (FLD) – integer programming TRIP Transcriptome Reconstruction using Integer Programming Genome single reads exons pseudo-exons Garber, M. et al. Nat. Biotechnol. June Select the smallest set of putative transcripts that yields a good statistical fit between – empirically determined during library preparation – implied by “mapping” read pairs How to filter? ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction transcript/Transcript Expression RNA-Seq Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. Transcriptome Reconstruction Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD 4/ billion PE reads/run 35-50bp read length Ion Proton Sequencer t3t2t1 Mean : 500; Std. dev. 50 Naïve IP formulation 1 std. dev. Objective: Constraints: Variables: – T i (p) - set of candidate transcripts on which pe read p can be mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x i (p) - 1 if the pe read p is selected to be mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 – n( s i ) - expected portion of reads mapped with fragment length between i-1 and i std. dev., i=1,2,3,4 IP formulation 1,2,3,4 std. dev. Human genome UCSC annotations  GNFAtlas2 gene expression levels – Uniform/geometric expression of gene transcripts  Normally distributed fragment lengths – Mean 500, std. dev. 50 Simulation Setup Sensitivity: - portion of the annotated transcripts being captured by assembly method PPV - portion of annotated transcript among assembled transcripts Accuracy measures Flowchart for TRIP: (a) Positive Predictive Value (PPV) Flowchart for TRIP: (b) Sensitivity