Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
University of Connecticut
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Data Analysis for High-Throughput Sequencing
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
CS 6293 Advanced Topics: Current Bioinformatics
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Next generation sequencing Xusheng Wang 4/29/2010.
Li and Dewey BMC Bioinformatics 2011, 12:323
Vertex labels swapping Edges swapping Pathway activity levels with ratio Abstract Metabolic pathway activity estimation from RNA-Seq data Yvette Temate-Tiagueu,
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Computational Methods for Analysis of Single Cell RNA-Seq Data
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Imputation-based local ancestry inference in admixed populations
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Next-generation sequencing technology
RNA Quantitation from RNAseq Data
VCF format: variants c.f. S. Brown NYU
Next-generation sequencing technology
Alexander Zelikovsky Computer Science Department
Imputation-based local ancestry inference in admixed populations
Gene expression estimation from RNA-Seq data
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky

Outline Introduction EM Algorithm Experimental results Conclusions and future work

Illumina Genome Analyzer bp reads Up to 38Gb/run Roche/454 FLX Titanium 400bp reads Up to 600Mb/run ABI SOLiD 50-75bp reads Up to 300Gb/run 2nd generation of sequencing technologies deliver several orders of magnitude higher throughput compared to classic Sanger sequencing Shorter reads! Helicos HeliScope 25-55bp reads Up to 35Gb/run Ultra-High Throughput Sequencing

Alternative Splicing [Griffith and Marra 07]

RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE)

Gene Expression Challenges Read ambiguity (multireads) What is the gene length? ABCDE

Previous approaches to GE Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE ABCDE AC

Previous approaches to IE [Jiang&Wong 09] – Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution

Our contribution EM Algorithm for IE – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores

Read-Isoform Compatibility

Fragment length distribution Paired reads ABC AC ABC ACAC ABC i j F a (i) F a (j)

Fragment length distribution Single reads ABC AC ABC AC ABC AC i j F a (i) F a (j)

IsoEM algorithm E-step M-step

Speed improvements Collapse identical reads into read classes i1 Isoforms i2i3i4i5i6 Reads (i1,i2)(i3,i4)(i3,i5)(i3,i4) LCA(i3,i4)

Speed improvements Run EM on connected components, in parallel i1 Isoforms i2 i3 i4 i5i6

Simulation setup Human genome UCSC known isoforms GNFAtlas2 gene expression levels – Uniform/geometric expression of gene isoforms Normally distributed fragment lengths – Mean 250, std. dev. 25

Accuracy measures Error Fraction (EF t ) – Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) – Threshold t for which EF is 50% r 2

Error Fraction Curves - Isoforms 30M single reads of length 25

Error Fraction Curves - Genes 30M single reads of length 25

MPE and EF 15 by Gene Frequency 30M single reads of length 25

Validation on Human RNA-Seq Data ≈8 million 27bp reads from two cell lines [Sultan et al. 10] 47 AEEs measured by qPCR [Richard et al. 10]

Read Length Effect on IE MPE Fixed sequencing throughput (750Mb) Single Reads Paired Reads

Read Length Effect on IE r 2 Fixed sequencing throughput (750Mb)

Effect of Pairs & Strand Information 75bp reads

Runtime scalability Scalability experiments conducted on a Dell PowerEdge R900 – Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal memory

Conclusions & Future Work Presented EM algorithm for estimating isoform/gene expression levels – Integrates fragment length distribution, base qualities, pair and strand info – Java implementation available at Ongoing work – Comparison of RNA-Seq with DGE – Isoform discovery – Reconstruction & frequency estimation for virus quasispecies

Acknowledgments NSF awards & to IM and to AZ