University of Edinburgh

University of Edinburgh
Beware the immeasurable: the genes RNA-Seq cannot accurately quantify and their role in human disease (plus a few unrelated slides about MinION) Mick Watson Edinburgh Genomics & The Roslin Institute University of Edinburgh

Are microarrays dead?

Submissions to NCBI GEO by technology
GEO submissions will lack behind trends! Not all RNA-Seq ends up in GEO, some goes to SRA Microarrays used in clinical trials, will never be submitted publicly

Microarray design What is the first step in microarray design?
We find unique regions of the genes we want to put on the array Why do we do that? Because different genes often have high sequence homology to one another Why do we think we don’t need to do the same for RNA-Seq?

How you think RNA-Seq works
Add 1 to counts table RNA-Seq pair Align to genome; overlaps an exon The reality is very different….

Consider a paired-end read
Read1: can align in 0, 1 or many locations (3 outcomes) Read2: can align in 0, 1 or many locations (9 outcomes) Read1 alignments can overlap 0, 1 or many genes (27 outcomes) Read2 alignments can overlap 0, 1 or many genes (81 outcomes) Those genes may be the same gene or different genes (162 outcomes) The reads may be on the same strand or different strands (324 outcomes) Some of those outcomes are mutually exclusive In reality we end up with 193 possible outcomes Only 49 outcomes represent “one read, one gene” model RNA-Seq software tools do not model all of those outcomes correctly!!

How big a problem is this?
Used 50SE RNA-Seq to analyse 5 different cell populations in a mouse lung cancer model Choi H, Sheng J, Gao D, Li F, Durrans A, Ryu S, Lee Sharrell B, Narula N, Rafii S, Elemento O, Altorki Nasser K, Wong Stephen TC, Mittal V: Transcriptome Analysis of Individual Stromal Cell Populations Identifies Stroma-Tumor Crosstalk in Mouse Lung Cancer Model. Cell Reports 2015, 10(7):

How big a problem is this?
We analysed the data using STAR to align the reads and htseq-count (with --union) to assign reads to genes (how you think RNA-Seq works)

Our work Took core human GRCh38 chromosomes. Extracted longest single transcript for each protein-coding gene. Removed short transcripts (< 400bp). Simulated 1000 perfect 100PE reads from each transcript, quantified using 12 different pipelines.

Correlation of expected vs observed FPKM

HTSeq based methods Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics (2):166-9.

Should you use HTSeq? Not if you care about gene families
Note: HTSeq immediately and without reservation throws out multi-mapped reads This is a deliberate “feature” of the software There are likely to be similar problems with other count-based methods

Cufflinks based methods
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5.

Cufflinks, FPKM and “effective length”
FPKM: fragments per kilobase per million Take the count of reads overlapping a gene/transcript Divide by the length of the transcript in Kb (because longer genes will have more reads) Divide by the size of the library in millions Who decides the length of the gene? It’s not you! Cufflinks does this by default, using “effective length”

Understanding the scatterplot….
We simulated the same number of reads for all genes The library size is the same – 19.65M Therefore, FPKM is only defined by the length of the gene i.e. in the plot to the left, short genes have high FPKM i.e. Cufflinks is over-estimating the FPKM of short genes (“it is known”)

Fixing the scatterplot
Note: we simulated reads from along the entire length of transcripts i.e. there is no effective length. Actual length == effective length We can turn effective length off in cufflinks (–no-effective-length-correction) So why is Cufflinks messing up the effective length of short transcripts? Our theory: Short transcripts/exons can be hard to map to (longer reads may exacerbate this!) If exons aren’t mapped to, they will shorten the “effective length”

Sailfish Sailfish builds database of kmers from known transcripts
No “mapping” – estimates expression directly from the reads using the kmer index Incredibly fast Bias correction hasn’t worked Over-estimated gene is GAGE2E Sailfish estimates over 8000 reads for this gene A member of the GAGE gene family implicated in a number of cancers

Kallisto Preprint came out after our work so not included in paper
Builds De Bruijn graph from transcripts No alignment Super fast About ~50 or so genes it gets (badly) wrong

Bad genes Data from all 12 methods for protein coding genes is available Use this to check your “favourite” genes and how accurate the methods are! Of genes, 958 were assigned counts < 100 or greater than 1900 by at least one method Errors dominated by HTSeq Both Cufflinks and Sailfish over- and under- estimate many genes too

Our solution? MMGs We believe there are some genes that cannot be accurately quantified by RNA-Seq Multi-map groups: defined as groups of genes that reads consistently multi-map to Data led rather than annotation led – however, find that data leads back to annotation We propose to analyse these genes as a “group” – look for differential expression at the level of the MMG If find differential expression, use a different tool (e.g. qPCR) to figure out which member is responsible

We do find signature in the MMGs
Robert C and Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease, Genome Biology, accepted

Robert C, Watson M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol Sep 3;16:177. doi: /s x. PubMed PMID: ; PubMed Central PMCID: PMC

Peakrescue

Rescuing reads A family of tools exists that tries to “rescue” multi-mapped reads to the correct gene Christelle Robert (and Shriram Bhosle) have written one which we think is very good: PeakRescue Under review at Bioinformatics

The minion

MinION: New USB sequencer
Good run: 35, Kb (mean) reads (2D: 90% identity) Bad run: 7-10, Hb (mean) reads (2D: 90% identity) Both produce muchn more (2-3x) 1D data at about We are looking for collaborators

poRe We were one of the first groups in the world to publish a MinION paper poRe: an R package to help users store and analyse MinION data Published in Bioinformatics Funded by BBSRC (TRDF)

The MinION can finish genomes
Illumina + MinION now the cheapest way to finish bacterial genomes ~800, PE MiSeq ~7000 MinION reads Commodity hardware, open source tools -> single chromosome with no gaps Preprint in bioRxiv

Follow me: Twitter: @BioMickWatson Blog: biomickwatson.wordpress.com

Acknowledgements Funders: BBSRC, Roslin Foundation, TSB
People: Judith Risse, Mark Blaxter, Garry Blakely, Marian Thomson, Richard Talbot, Edinburgh Genomics, Christelle Robert, Shriram Bhosle Edinburgh Genomics: The Roslin Institute:

University of Edinburgh

Similar presentations

Presentation on theme: "University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Edinburgh

Similar presentations

Presentation on theme: "University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback