Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines.

Similar presentations


Presentation on theme: "SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines."— Presentation transcript:

1 SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines.

2 Overview Used only 25 million reads per replicate 4 Samples, with 4 technical replicates (labeldd A1,A2 through D3,D4) – UHR (Sample A), Brain (Sample B) – Two mixes of UHR & Brain (C & D), where C is 75% A and D is 25% A Detection is arbitratily defined as FPKM>.15 “Titration” results should therefore be in order… – A>C>D>B – A<C<D<B – Titration numbers are reported as #genes titrated/#genes detected For all analyses, reads were clipped & filtered using the fastq-mcf tool (default settings) prior to analysis Summary tool is available to be run on an SEQC FPKM matrix at https://data1.expressionanalysis.com/seqc.cgi https://data1.expressionanalysis.com/seqc.cgi

3 Pipelines Used tophat2 & tophat1.4 – with the –G parameter, so it uses the same GTF file as other methods – Tophat1 & 2 were both filtered for “FAIL” entries. – Tophat1 & 2 were roughly equal, tophat1 only is shown in some graphs because it was very slightly better on all metrics – Bias correction algorithm xprs –using BWA alignment & the express-1.3.0 quantifier – eXpress isoforms were filtered using the “F” flag. When this filtering was turned off we generated a set labeled“xprsnf” – Expectation maximization algorithm rsem – version 1.2.0, default parameters – Expectation maximization algorithm rsembwa – version 1.2.0, using the same BWA alignments as xprs refq – using the same BWA alignments & counting genes & isoforms that are weighted towards unambiguous alignments (1/sq(ambig)). This is used as a “naive” quantifier.

4 Titration & Detection Results From the refq results, it’s clear that consistent, but possibly false, results can manifest as be “high detection and high titration”. Therefore titration and detection are not solely good indicators of a quality quantifier. Good CVS as can be faked (as in the refq results) by a flat isoform distribution. These measures must be coupled with a measure of false-positives (qPCR). BWA improves the repeatability of RSEM Tophat’s gene repeatability is good, but isoform repeatability is best in the EM estimators Tophat1 & naïve counting had the best gene-level “titration+detection” RSEM/Xprs has the best isoform-level titration and repeatability, with the exception of the known-false naïve reference counting titration detection titration detection

5 PCR – Slope & Correlation (one value per replicate-sample) rsembwa and eXpress with bwa had the highest correlations, the only tools that beat naïve quantification for median correlation. rsem -> rsembwa shows a large improvement in correlation, with some loss in slope tophat’s results, while good in titration, were poor when compared with PCR, better than naïve on slope, but not correlation rsembwa/xPress perform similarly (higher correlation, lower slope) when eXpress is unfiltered tophat/cufflinks’s performance was not rescued when unfiltered (shown as th1nf)

6 RSEM has stratification of very low FPKM values, because of rounding. Above a minimum mean value of 1, RSEM performed as well Tophat1 performed very well according to this metric. As a control, the naïve plots also performed very well, since they fail to distinguish isoforms well Pipeline Stability – Bland Altman

7 xPresstophat1rsemBWA When filtering matrices for “F” or “FALSE” out of xPress and tophat, we improved titration, but reduced detection compared to naïve, resulting in zeroes tophat had a lower correlation than the EM methods Simple, naïve counting had a comparable correlation, but a flatter Deming slope than other methods refq (naïve) rsem Isoform PCR - Log Log Plots (B1)

8 xPressNFth1nfrsemBWA These plots have “very low count” genes filtered out, ro prevent “dividing by near zero” in the fold-change comparisons. These outliers will be tracked separately. naïve counting results in better correlated log ratios than other methods, this seems to indicate that the qPCR used may not be very isoform specific Xpress, rsem and tophat/cufflinks performed similarly refq (naïve)rsem Isoform PCR - Log A-B Plots (Rep 1)

9 xPressNFth1nfrsemBWA Since C & D are mixtures, threfore closer, it was harder to match PCR on all methods The range of signal differences in PCR should be 50% of A/B. Xpress, naïve and rsembwa matched that expectation Again, naïve counting results in better correlated log ratios, and a poor slope Again, unfiltered xpress had the best slope, and bwa xpress and rsem had the best log ratio correlations refq (naïve)rsem Isoform PCR - Log C-D Plots (Rep 1) Undetected by other methods, possibly isoform ambiguity issue?

10 Metric Summary Although no one method is the clear “winner” by all metrics, the BWA/EM Methods have good PCR correlation, even in the more difficult c/d test, while maintaining low variability The inclusion of a “naïve” metric (underlines) is an important control, highlighting the flaws in any one measure of quality It’s important to measure quality using both isoform-level and gene-level metrics. ISOFORM METRICSGENE METRICS iso titration pct med detection count med detection pct median cv bland- altman stdev pcr correlation pcr a/b correlation pcr c/d correlation titration pct med detection count med detection pct median cv bland- altman stdev tophat157.0341277.553.220.30630.7790.79340.92150.599283.5019092.568.380.04660.543 th1nf57.7242014.554.140.29660.6990.79470.92680.635084.5219264.568.980.04530.377 rsem61.9337234.047.970.36321.0780.78740.92210.700880.3919487.069.750.07270.366 rsembwa59.9144061.556.770.32820.9350.83860.91930.687381.3919294.069.060.07270.365 xprs61.5942490.055.570.31900.8600.81980.90050.684180.9919011.568.050.07520.345 xprsnf55.6344603.557.470.35340.7780.83440.91990.738580.9919011.568.050.07520.345 refq77.4757662.580.990.09810.2890.78540.95300.859781.4919978.583.390.07490.357 Q9061.7644332.557.120.30140.7380.83650.92450.719684.0119390.569.400.04600.345 Q7561.1743668.656.470.30940.7780.83080.92200.697482.9719286.669.040.05310.350 *top 10% are shaded, top 25% are bordered


Download ppt "SEQC Pipeline Comparison Using SEQC data to create performance metrics for RNA Quantification pipelines."

Similar presentations


Ads by Google