University of Edinburgh

Slides:

Advertisements

Similar presentations

RNA-Seq as a Discovery Tool

Advertisements

Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.

Marius Nicolae Computer Science and Engineering Department

RNA-Seq based discovery and reconstruction of unannotated transcripts

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Peter Tsai Bioinformatics Institute, University of Auckland

DEG Mi-kyoung Seo.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

High Throughput Sequencing

mRNA-Seq: methods and applications

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Li and Dewey BMC Bioinformatics 2011, 12:323

Expression Analysis of RNA-seq Data

RNAseq analyses -- methods

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.

The iPlant Collaborative

RNA Sequencing I: De novo RNAseq

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Introduction to RNAseq

Melissa Spear Phillips Lab Mentor: Rose Reynolds SPUR 2011 Evolution of Genetic Networks Controlling Cellular Stress Response and Longevity.

The iPlant Collaborative

Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Overview of Genomics Workflows

Canadian Bioinformatics Workshops

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

Simon v RNA-Seq Analysis Simon v

GCC Workshop 9 RNA-Seq with Galaxy

Lesson: Sequence processing

RNA Quantitation from RNAseq Data

An Introduction to RNA-Seq Data and Differential Expression Tools in R

Placental Bioinformatics

Is the end of RNA-Seq alignment?

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

Transcriptomics II De novo assembly

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO

Kallisto: near-optimal RNA seq quantification tool

Gene expression estimation from RNA-Seq data

Jin Zhang, Jiayin Wang and Yufeng Wu

Reference based assembly

Genomes and Their Evolution

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Identification and Characterization of pre-miRNA Candidates in the C

Learning to count: quantifying signal

RNA sequencing (RNA-Seq) and its application in ovarian cancer

Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin Cell Reports

Artefacts and Biases in Gene Set Analysis

Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,

Additional file 2: RNA-Seq data analysis pipeline

Quantitative analyses using RNA-seq data

Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,

Sequence Analysis - RNA-Seq 2

Schematic representation of a transcriptomic evaluation approach.

Volume 11, Issue 7, Pages (May 2015)

Volume 10, Issue 7, Pages (February 2015)

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

University of Edinburgh Beware the immeasurable: the genes RNA-Seq cannot accurately quantify and their role in human disease (plus a few unrelated slides about MinION) Mick Watson Edinburgh Genomics & The Roslin Institute University of Edinburgh

Are microarrays dead?

Submissions to NCBI GEO by technology GEO submissions will lack behind trends! Not all RNA-Seq ends up in GEO, some goes to SRA Microarrays used in clinical trials, will never be submitted publicly

Microarray design What is the first step in microarray design? We find unique regions of the genes we want to put on the array Why do we do that? Because different genes often have high sequence homology to one another Why do we think we don’t need to do the same for RNA-Seq?

How you think RNA-Seq works Add 1 to counts table RNA-Seq pair Align to genome; overlaps an exon The reality is very different….

Consider a paired-end read Read1: can align in 0, 1 or many locations (3 outcomes) Read2: can align in 0, 1 or many locations (9 outcomes) Read1 alignments can overlap 0, 1 or many genes (27 outcomes) Read2 alignments can overlap 0, 1 or many genes (81 outcomes) Those genes may be the same gene or different genes (162 outcomes) The reads may be on the same strand or different strands (324 outcomes) Some of those outcomes are mutually exclusive In reality we end up with 193 possible outcomes Only 49 outcomes represent “one read, one gene” model RNA-Seq software tools do not model all of those outcomes correctly!!

How big a problem is this? Used 50SE RNA-Seq to analyse 5 different cell populations in a mouse lung cancer model Choi H, Sheng J, Gao D, Li F, Durrans A, Ryu S, Lee Sharrell B, Narula N, Rafii S, Elemento O, Altorki Nasser K, Wong Stephen TC, Mittal V: Transcriptome Analysis of Individual Stromal Cell Populations Identifies Stroma-Tumor Crosstalk in Mouse Lung Cancer Model. Cell Reports 2015, 10(7):1187-1201.

How big a problem is this? We analysed the data using STAR to align the reads and htseq-count (with --union) to assign reads to genes (how you think RNA-Seq works)

Our work Took core human GRCh38 chromosomes. Extracted longest single transcript for each protein-coding gene. Removed short transcripts (< 400bp). Simulated 1000 perfect 100PE reads from each transcript, quantified using 12 different pipelines.

Correlation of expected vs observed FPKM

HTSeq based methods Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 31(2):166-9.

Should you use HTSeq? Not if you care about gene families Note: HTSeq immediately and without reservation throws out multi-mapped reads This is a deliberate “feature” of the software There are likely to be similar problems with other count-based methods

Cufflinks based methods Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5.

Cufflinks, FPKM and “effective length” FPKM: fragments per kilobase per million Take the count of reads overlapping a gene/transcript Divide by the length of the transcript in Kb (because longer genes will have more reads) Divide by the size of the library in millions Who decides the length of the gene? It’s not you! Cufflinks does this by default, using “effective length”

Understanding the scatterplot…. We simulated the same number of reads for all genes The library size is the same – 19.65M Therefore, FPKM is only defined by the length of the gene i.e. in the plot to the left, short genes have high FPKM i.e. Cufflinks is over-estimating the FPKM of short genes (“it is known”)

Fixing the scatterplot Note: we simulated reads from along the entire length of transcripts i.e. there is no effective length. Actual length == effective length We can turn effective length off in cufflinks (–no-effective-length-correction) So why is Cufflinks messing up the effective length of short transcripts? Our theory: Short transcripts/exons can be hard to map to (longer reads may exacerbate this!) If exons aren’t mapped to, they will shorten the “effective length”

Sailfish Sailfish builds database of kmers from known transcripts No “mapping” – estimates expression directly from the reads using the kmer index Incredibly fast Bias correction hasn’t worked Over-estimated gene is GAGE2E Sailfish estimates over 8000 reads for this gene A member of the GAGE gene family implicated in a number of cancers

Kallisto Preprint came out after our work so not included in paper Builds De Bruijn graph from transcripts No alignment Super fast About ~50 or so genes it gets (badly) wrong

Bad genes Data from all 12 methods for 19654 protein coding genes is available Use this to check your “favourite” genes and how accurate the methods are! Of 19654 genes, 958 were assigned counts < 100 or greater than 1900 by at least one method Errors dominated by HTSeq Both Cufflinks and Sailfish over- and under- estimate many genes too

Our solution? MMGs We believe there are some genes that cannot be accurately quantified by RNA-Seq Multi-map groups: defined as groups of genes that reads consistently multi-map to Data led rather than annotation led – however, find that data leads back to annotation We propose to analyse these genes as a “group” – look for differential expression at the level of the MMG If find differential expression, use a different tool (e.g. qPCR) to figure out which member is responsible

We do find signature in the MMGs Robert C and Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease, Genome Biology, accepted

Robert C, Watson M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol. 2015 Sep 3;16:177. doi: 10.1186/s13059-015-0734-x. PubMed PMID: 26335491; PubMed Central PMCID: PMC4558956.

Peakrescue

Rescuing reads A family of tools exists that tries to “rescue” multi-mapped reads to the correct gene Christelle Robert (and Shriram Bhosle) have written one which we think is very good: PeakRescue Under review at Bioinformatics

The minion

MinION: New USB sequencer Good run: 35,000 6.5Kb (mean) reads (2D: 90% identity) Bad run: 7-10,000 6.5Hb (mean) reads (2D: 90% identity) Both produce muchn more (2-3x) 1D data at about We are looking for collaborators

poRe We were one of the first groups in the world to publish a MinION paper poRe: an R package to help users store and analyse MinION data Published in Bioinformatics Funded by BBSRC (TRDF)

The MinION can finish genomes Illumina + MinION now the cheapest way to finish bacterial genomes ~800,000 240PE MiSeq ~7000 MinION reads Commodity hardware, open source tools -> single chromosome with no gaps Preprint in bioRxiv

Follow me: Twitter: @BioMickWatson Blog: biomickwatson.wordpress.com

Acknowledgements Funders: BBSRC, Roslin Foundation, TSB People: Judith Risse, Mark Blaxter, Garry Blakely, Marian Thomson, Richard Talbot, Edinburgh Genomics, Christelle Robert, Shriram Bhosle Edinburgh Genomics: http://genomics.ed.ac.uk The Roslin Institute: http://www.roslin.ed.ac.uk