Is the end of RNA-Seq alignment? Mick Watson Edinburgh Genomics & The Roslin Institute University of Edinburgh
Are microarrays dead?
Submissions to NCBI GEO by technology GEO submissions will lack behind trends! Not all RNA-Seq ends up in GEO, some goes to SRA Microarrays used in clinical trials, will never be submitted publicly
Microarray design What is the first step in microarray design? We find unique regions of the genes we want to put on the array Why do we do that? Because different genes often have high sequence homology to one another Why do we think we don’t need to do the same for RNA-Seq?
How you think RNA-Seq works Add 1 to counts table RNA-Seq pair Align to genome; overlaps an exon The reality is very different….
Consider a paired-end read Read1: can align in 0, 1 or many locations (3 outcomes) Read2: can align in 0, 1 or many locations (9 outcomes) Read1 alignments can overlap 0, 1 or many genes (27 outcomes) Read2 alignments can overlap 0, 1 or many genes (81 outcomes) Those genes may be the same gene or different genes (162 outcomes) The reads may be on the same strand or different strands (324 outcomes) Some of those outcomes are mutually exclusive In reality we end up with 193 possible outcomes Only 49 outcomes represent “one read, one gene” model RNA-Seq software tools do not model all of those outcomes correctly!!
How big a problem is this? Used 50SE RNA-Seq to analyse 5 different cell populations in a mouse lung cancer model Choi H, Sheng J, Gao D, Li F, Durrans A, Ryu S, Lee Sharrell B, Narula N, Rafii S, Elemento O, Altorki Nasser K, Wong Stephen TC, Mittal V: Transcriptome Analysis of Individual Stromal Cell Populations Identifies Stroma-Tumor Crosstalk in Mouse Lung Cancer Model. Cell Reports 2015, 10(7):1187-1201.
How big a problem is this? We analysed the data using STAR to align the reads and htseq-count (with --union) to assign reads to genes (how you think RNA-Seq works)
Our work Took core human GRCh38 chromosomes. Extracted longest single transcript for each protein-coding gene. Removed short transcripts (< 400bp). Simulated 1000 perfect 100PE reads from each transcript, quantified using 12 different pipelines.
HTSeq based methods Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 31(2):166-9.
HTSeq false negatives Note: HTSeq immediately and without reservation throws out multi-mapped reads This is a deliberate “feature” of the software There are likely to be similar “problems” with other count-based methods
Cufflinks based methods Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5.
Cufflinks, FPKM and “effective length” FPKM: fragments per kilobase per million Take the count of reads overlapping a gene/transcript Divide by the length of the transcript in Kb (because longer genes will have more reads) Divide by the size of the library in millions Who decides the length of the gene? It’s not you! Cufflinks does this by default, using “effective length”
Understanding the scatterplot…. We simulated the same number of reads for all genes The library size is the same – 19.65M Therefore, FPKM is only defined by the length of the gene i.e. in the plot to the left, short genes have high FPKM i.e. Cufflinks is over-estimating the FPKM of short genes (“it is known”)
Fixing the scatterplot Note: we simulated reads from along the entire length of transcripts i.e. there is no effective length. Actual length == effective length We can turn effective length off in cufflinks (–no-effective-length-correction) So why is Cufflinks messing up the effective length of short transcripts? Our theory: Short transcripts/exons can be hard to map to (longer reads may exacerbate this!) If exons aren’t mapped to, they will shorten the “effective length”
Sailfish Sailfish builds database of kmers from known transcripts No “mapping” – estimates expression directly from the reads using the kmer index Incredibly fast Bias correction hasn’t worked Over-estimated gene is GAGE2E Sailfish estimates over 8000 reads for this gene A member of the GAGE gene family implicated in a number of cancers
Kallisto Preprint came out after our work so not included in paper Builds De Bruijn graph from transcripts No alignment Super fast About ~50 or so genes it gets (badly) wrong
Bad genes Data from all 12 methods for 19654 protein coding genes is available Use this to check your “favourite” genes and how accurate the methods are! Of 19654 genes, 958 were assigned counts < 100 or greater than 1900 by at least one method Errors dominated by HTSeq Both Cufflinks and Sailfish over- and under- estimate many genes too
Our solution? MMGs We believe there are some genes that cannot be accurately quantified by RNA-Seq Multi-map groups: defined as groups of genes that reads consistently multi-map to Data led rather than annotation led – however, find that data leads back to annotation We propose to analyse these genes as a “group” – look for differential expression at the level of the MMG If find differential expression, use a different tool (e.g. qPCR) to figure out which member is responsible
We do find signature in the MMGs Robert C and Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease, Genome Biology, accepted
Is this the end of RNA-Seq alignment? No: but only because we can align to define gene structure Alignment-free methods are fast and accurate Sailfish Kallisto Salmon All of the above similar to microarrays in concept Rely on a kind of “in silico hybridisation” We don’t know how robust they are to poor annotation Counting reads at the level of the MMG can reveal novel insights
Robert C, Watson M. (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol. 16:177
Follow me: Twitter: @BioMickWatson Blog: biomickwatson.wordpress.com
Acknowledgements Funders: BBSRC, Roslin Foundation, TSB People: Edinburgh Genomics, Roslin, Christelle Robert, Shriram Bhosle, Alan Archibald, David Hume Edinburgh Genomics: http://genomics.ed.ac.uk The Roslin Institute: http://www.roslin.ed.ac.uk