SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines.

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

Statistical Techniques I EXST7005 Start here Measures of Dispersion.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Keyword Spotting Using Crosscorrelation Engineering Expo Banquet 2009.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Peter Tsai Bioinformatics Institute, University of Auckland
Measuring the degree of similarity: PAM and blosum Matrix
Visual Recognition Tutorial
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Empirical Tests of the Capital Asset Pricing Model (Chapter 9)
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Detecting and Tracking Moving Objects for Video Surveillance Isaac Cohen and Gerard Medioni University of Southern California.
Cluster Threshold Optimization from TIF data David Stuart, UC Santa Barbara July 26, 2007.
BCOR 1020 Business Statistics
Visual Recognition Tutorial
Analysis of Drug-Gene Interaction Data Florian Ganglberger Sebastian Nijman Lab.
RNA-seq Analysis in Galaxy
Introduction to Longitudinal Phase Space Tomography Duncan Scott.
Assumption of Homoscedasticity
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
CHAPTER 05 RISK&RETURN. Formal Definition- RISK # The variability of returns from those that are expected. Or, # The chance that some unfavorable event.
Li and Dewey BMC Bioinformatics 2011, 12:323
Poor Reproducibility of HIV­1 Low-level Viraemia Results with 3 Commercial Real-time PCR Assays Jean Ruelle 1, Laurent Debaisieux 2, Ellen Vancutsem 3,
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
Quantitative Skills: Data Analysis
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
RNAseq analyses -- methods
MULTIPLE TRIANGLE MODELLING ( or MPTF ) APPLICATIONS MULTIPLE LINES OF BUSINESS- DIVERSIFICATION? MULTIPLE SEGMENTS –MEDICAL VERSUS INDEMNITY –SAME LINE,
RNA-Seq Analysis Simon V4.1.
Introduction to RNA-Seq
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Quality Assurance How do you know your results are correct? How confident are you?
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
How many times can you write statistics in a minute? By: Madeline Stenken and Tara Levine.
Solutions A. Graph B. Women (5-number summary) 101, 126, 138.5, 154, 200. Men: 70, 98, 114.5, 143, 187. C. Women generally score higher than men.
Optimization of  exclusion cut for the  + and  (1520) analysis Takashi Nakano Based on Draft version of Technical Note 42.
The iPlant Collaborative
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Using Measures of Position (rather than value) to Describe Spread? 1.
The iPlant Collaborative
Jin Huang M.I.T. For Transversity Collaboration Meeting Jan 29, JLab.
Worked examples and exercises are in the text STROUD PROGRAMME 27 STATISTICS.
Adiabatic Quantum Computing Josh Ball with advisor Professor Harsh Mathur Problems which are classically difficult to solve may be solved much more quickly.
STROUD Worked examples and exercises are in the text Programme 28: Data handling and statistics DATA HANDLING AND STATISTICS PROGRAMME 28.
Modelling Multiple Lines of Business: Detecting and using correlations in reserve forecasting. Presenter: Dr David Odell Insureware, Australia.
SVY207 Lecture 8: The Carrier Phase Observable
Does the brain compute confidence estimates about decisions?
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
RNA-Seq analysis in R (Bioconductor)
QC analysis Uppsala University Work done by Jonas Almlöf
Kallisto: near-optimal RNA seq quantification tool
Molecular phenotyping of HCS-2/8 cells as an in vitro model of human chondrocytes  J. Saas, Ph.D., K. Lindauer, Ph.D., B. Bau, M.Sc., M. Takigawa, D.D.S.,
Experimental Power Graphing Program
Exploring and Understanding ChIP-Seq data
Volume 45, Issue 5, Pages (November 2016)
Multiple Regression – Split Sample Validation
Joseph Rodriguez, Jerome S. Menet, Michael Rosbash  Molecular Cell 
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Volume 26, Issue 5, Pages (March 2016)
Volume 7, Issue 3, Pages e12 (September 2018)
Volume 107, Issue 4, Pages (November 2001)
Quantitative analyses using RNA-seq data
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Sequence Analysis - RNA-Seq 2
Basic Anthropometric data quality checks
Presentation transcript:

SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines.

Overview Used only 25 million reads per replicate 4 Samples, with 4 technical replicates (labeldd A1,A2 through D3,D4) – UHR (Sample A), Brain (Sample B) – Two mixes of UHR & Brain (C & D), where C is 75% A and D is 25% A Detection is arbitratily defined as FPKM>.15 “Titration” results should therefore be in order… – A>C>D>B – A<C<D<B – Titration numbers are reported as #genes titrated/#genes detected For all analyses, reads were clipped & filtered using the fastq-mcf tool (default settings) prior to analysis Summary tool is available to be run on an SEQC FPKM matrix at

Pipelines Used tophat2 & tophat1.4 – with the –G parameter, so it uses the same GTF file as other methods – Tophat1 & 2 were both filtered for “FAIL” entries. – Tophat1 & 2 were roughly equal, tophat1 only is shown in some graphs because it was very slightly better on all metrics – Bias correction algorithm xprs –using BWA alignment & the express quantifier – eXpress isoforms were filtered using the “F” flag. When this filtering was turned off we generated a set labeled“xprsnf” – Expectation maximization algorithm rsem – version 1.2.0, default parameters – Expectation maximization algorithm rsembwa – version 1.2.0, using the same BWA alignments as xprs refq – using the same BWA alignments & counting genes & isoforms that are weighted towards unambiguous alignments (1/sq(ambig)). This is used as a “naive” quantifier.

Titration & Detection Results From the refq results, it’s clear that consistent, but possibly false, results can manifest as be “high detection and high titration”. Therefore titration and detection are not solely good indicators of a quality quantifier. Good CVS as can be faked (as in the refq results) by a flat isoform distribution. These measures must be coupled with a measure of false-positives (qPCR). BWA improves the repeatability of RSEM Tophat’s gene repeatability is good, but isoform repeatability is best in the EM estimators Tophat1 & naïve counting had the best gene-level “titration+detection” RSEM/Xprs has the best isoform-level titration and repeatability, with the exception of the known-false naïve reference counting titration detection titration detection

PCR – Slope & Correlation (one value per replicate-sample) rsembwa and eXpress with bwa had the highest correlations, the only tools that beat naïve quantification for median correlation. rsem -> rsembwa shows a large improvement in correlation, with some loss in slope tophat’s results, while good in titration, were poor when compared with PCR, better than naïve on slope, but not correlation rsembwa/xPress perform similarly (higher correlation, lower slope) when eXpress is unfiltered tophat/cufflinks’s performance was not rescued when unfiltered (shown as th1nf)

RSEM has stratification of very low FPKM values, because of rounding. Above a minimum mean value of 1, RSEM performed as well Tophat1 performed very well according to this metric. As a control, the naïve plots also performed very well, since they fail to distinguish isoforms well Pipeline Stability – Bland Altman

xPresstophat1rsemBWA When filtering matrices for “F” or “FALSE” out of xPress and tophat, we improved titration, but reduced detection compared to naïve, resulting in zeroes tophat had a lower correlation than the EM methods Simple, naïve counting had a comparable correlation, but a flatter Deming slope than other methods refq (naïve) rsem Isoform PCR - Log Log Plots (B1)

xPressNFth1nfrsemBWA These plots have “very low count” genes filtered out, ro prevent “dividing by near zero” in the fold-change comparisons. These outliers will be tracked separately. naïve counting results in better correlated log ratios than other methods, this seems to indicate that the qPCR used may not be very isoform specific Xpress, rsem and tophat/cufflinks performed similarly refq (naïve)rsem Isoform PCR - Log A-B Plots (Rep 1)

xPressNFth1nfrsemBWA Since C & D are mixtures, threfore closer, it was harder to match PCR on all methods The range of signal differences in PCR should be 50% of A/B. Xpress, naïve and rsembwa matched that expectation Again, naïve counting results in better correlated log ratios, and a poor slope Again, unfiltered xpress had the best slope, and bwa xpress and rsem had the best log ratio correlations refq (naïve)rsem Isoform PCR - Log C-D Plots (Rep 1) Undetected by other methods, possibly isoform ambiguity issue?

Metric Summary Although no one method is the clear “winner” by all metrics, the BWA/EM Methods have good PCR correlation, even in the more difficult c/d test, while maintaining low variability The inclusion of a “naïve” metric (underlines) is an important control, highlighting the flaws in any one measure of quality It’s important to measure quality using both isoform-level and gene-level metrics. ISOFORM METRICSGENE METRICS iso titration pct med detection count med detection pct median cv bland- altman stdev pcr correlation pcr a/b correlation pcr c/d correlation titration pct med detection count med detection pct median cv bland- altman stdev tophat th1nf rsem rsembwa xprs xprsnf refq Q Q *top 10% are shaded, top 25% are bordered