Is the end of RNA-Seq alignment?

Slides:

Advertisements

Similar presentations

RNA-Seq as a Discovery Tool

Advertisements

Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.

Marius Nicolae Computer Science and Engineering Department

RNA-Seq based discovery and reconstruction of unannotated transcripts

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Transcriptome reconstruction and quantification

Peter Tsai Bioinformatics Institute, University of Auckland

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

High Throughput Sequencing

mRNA-Seq: methods and applications

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

RNA-Seq and RNA Structure Prediction

Li and Dewey BMC Bioinformatics 2011, 12:323

Expression Analysis of RNA-seq Data

Todd J. Treangen, Steven L. Salzberg

Rhesus genome annotations Rob Norgren Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center.

RNAseq analyses -- methods

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

The iPlant Collaborative

RNA Sequencing I: De novo RNAseq

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Melissa Spear Phillips Lab Mentor: Rose Reynolds SPUR 2011 Evolution of Genetic Networks Controlling Cellular Stress Response and Longevity.

The iPlant Collaborative

No reference available

Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.

RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Overview of Genomics Workflows

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

How to Use This Presentation

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

Simon v RNA-Seq Analysis Simon v

GCC Workshop 9 RNA-Seq with Galaxy

RNA Quantitation from RNAseq Data

An Introduction to RNA-Seq Data and Differential Expression Tools in R

Placental Bioinformatics

The Transcriptional Landscape of the Mammalian Genome

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

University of Edinburgh

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

Transcriptomics II De novo assembly

High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO

Kallisto: near-optimal RNA seq quantification tool

Gene expression estimation from RNA-Seq data

Volume 71, Issue 2, Pages (February 2017)

Reference based assembly

Genomes and Their Evolution

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Identification and Characterization of pre-miRNA Candidates in the C

Visualising and Exploring BS-Seq Data

Learning to count: quantifying signal

RNA sequencing (RNA-Seq) and its application in ovarian cancer

Artefacts and Biases in Gene Set Analysis

Unit Genomic sequencing

Additional file 2: RNA-Seq data analysis pipeline

Quantitative analyses using RNA-seq data

Sequence Analysis - RNA-Seq 2

Schematic representation of a transcriptomic evaluation approach.

Volume 11, Issue 7, Pages (May 2015)

Volume 10, Issue 7, Pages (February 2015)

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

Is the end of RNA-Seq alignment? Mick Watson Edinburgh Genomics & The Roslin Institute University of Edinburgh

Are microarrays dead?

Submissions to NCBI GEO by technology GEO submissions will lack behind trends! Not all RNA-Seq ends up in GEO, some goes to SRA Microarrays used in clinical trials, will never be submitted publicly

Microarray design What is the first step in microarray design? We find unique regions of the genes we want to put on the array Why do we do that? Because different genes often have high sequence homology to one another Why do we think we don’t need to do the same for RNA-Seq?

How you think RNA-Seq works Add 1 to counts table RNA-Seq pair Align to genome; overlaps an exon The reality is very different….

Consider a paired-end read Read1: can align in 0, 1 or many locations (3 outcomes) Read2: can align in 0, 1 or many locations (9 outcomes) Read1 alignments can overlap 0, 1 or many genes (27 outcomes) Read2 alignments can overlap 0, 1 or many genes (81 outcomes) Those genes may be the same gene or different genes (162 outcomes) The reads may be on the same strand or different strands (324 outcomes) Some of those outcomes are mutually exclusive In reality we end up with 193 possible outcomes Only 49 outcomes represent “one read, one gene” model RNA-Seq software tools do not model all of those outcomes correctly!!

How big a problem is this? Used 50SE RNA-Seq to analyse 5 different cell populations in a mouse lung cancer model Choi H, Sheng J, Gao D, Li F, Durrans A, Ryu S, Lee Sharrell B, Narula N, Rafii S, Elemento O, Altorki Nasser K, Wong Stephen TC, Mittal V: Transcriptome Analysis of Individual Stromal Cell Populations Identifies Stroma-Tumor Crosstalk in Mouse Lung Cancer Model. Cell Reports 2015, 10(7):1187-1201.

How big a problem is this? We analysed the data using STAR to align the reads and htseq-count (with --union) to assign reads to genes (how you think RNA-Seq works)

Our work Took core human GRCh38 chromosomes. Extracted longest single transcript for each protein-coding gene. Removed short transcripts (< 400bp). Simulated 1000 perfect 100PE reads from each transcript, quantified using 12 different pipelines.

HTSeq based methods Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 31(2):166-9.

HTSeq false negatives Note: HTSeq immediately and without reservation throws out multi-mapped reads This is a deliberate “feature” of the software There are likely to be similar “problems” with other count-based methods

Cufflinks based methods Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5.

Cufflinks, FPKM and “effective length” FPKM: fragments per kilobase per million Take the count of reads overlapping a gene/transcript Divide by the length of the transcript in Kb (because longer genes will have more reads) Divide by the size of the library in millions Who decides the length of the gene? It’s not you! Cufflinks does this by default, using “effective length”

Understanding the scatterplot…. We simulated the same number of reads for all genes The library size is the same – 19.65M Therefore, FPKM is only defined by the length of the gene i.e. in the plot to the left, short genes have high FPKM i.e. Cufflinks is over-estimating the FPKM of short genes (“it is known”)

Fixing the scatterplot Note: we simulated reads from along the entire length of transcripts i.e. there is no effective length. Actual length == effective length We can turn effective length off in cufflinks (–no-effective-length-correction) So why is Cufflinks messing up the effective length of short transcripts? Our theory: Short transcripts/exons can be hard to map to (longer reads may exacerbate this!) If exons aren’t mapped to, they will shorten the “effective length”

Sailfish Sailfish builds database of kmers from known transcripts No “mapping” – estimates expression directly from the reads using the kmer index Incredibly fast Bias correction hasn’t worked Over-estimated gene is GAGE2E Sailfish estimates over 8000 reads for this gene A member of the GAGE gene family implicated in a number of cancers

Kallisto Preprint came out after our work so not included in paper Builds De Bruijn graph from transcripts No alignment Super fast About ~50 or so genes it gets (badly) wrong

Bad genes Data from all 12 methods for 19654 protein coding genes is available Use this to check your “favourite” genes and how accurate the methods are! Of 19654 genes, 958 were assigned counts < 100 or greater than 1900 by at least one method Errors dominated by HTSeq Both Cufflinks and Sailfish over- and under- estimate many genes too

Our solution? MMGs We believe there are some genes that cannot be accurately quantified by RNA-Seq Multi-map groups: defined as groups of genes that reads consistently multi-map to Data led rather than annotation led – however, find that data leads back to annotation We propose to analyse these genes as a “group” – look for differential expression at the level of the MMG If find differential expression, use a different tool (e.g. qPCR) to figure out which member is responsible

We do find signature in the MMGs Robert C and Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease, Genome Biology, accepted

Is this the end of RNA-Seq alignment? No: but only because we can align to define gene structure Alignment-free methods are fast and accurate Sailfish Kallisto Salmon All of the above similar to microarrays in concept Rely on a kind of “in silico hybridisation” We don’t know how robust they are to poor annotation Counting reads at the level of the MMG can reveal novel insights

Robert C, Watson M. (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol. 16:177

Follow me: Twitter: @BioMickWatson Blog: biomickwatson.wordpress.com

Acknowledgements Funders: BBSRC, Roslin Foundation, TSB People: Edinburgh Genomics, Roslin, Christelle Robert, Shriram Bhosle, Alan Archibald, David Hume Edinburgh Genomics: http://genomics.ed.ac.uk The Roslin Institute: http://www.roslin.ed.ac.uk