Why weight? Variance modelling for designed RNA-seq experiments Abstract: Outlier samples are relatively common in RNA-seq experiments and the root cause of such variation is generally unknown. In small experiments, the analyst is left with the difficult decision of what to do: removing the offending sample may reduce variation, but at a cost of reducing power, which can limit our ability to detect biologically meaningful changes. A compromise is to use all of the available data, but to down-weight the observations from the outlier sample in the analysis. In this poster we describe a statistical approach that allows this by modelling heterogeneity at both the sample and observational level in the differential expression analysis. Using both simulations and real data, we tease apart scenarios where this strategy leads to a more powerful analysis. Our approach is implemented in the open-source limma package available from Bioconductor (http://www.bioconductor.org). Matthew Ritchie Walter + Eliza Hall Institute ABiC, 11th October 2014
RNA-seq in genomics research Use of high-throughput sequencing (‘next-’ or ‘second’-generation) technologies to study gene expression Many applications: differential expression, transcript discovery, alternative splicing, allele-specific expression Focus today is on adapting methods for differential expression analysis to work well on messy data sets RNA-Sequencing is becoming increasingly popular for the study of differentially gene expression. The RNA-Seq technology sequences short reads and aligns them back to the genome. The number of reads which map to a particular gene, exon, or some other feature, is recorded giving data in the form of counts. Chip-seq data also comes in the form of counts, and although I’ll be talking about RNA-Seq data, the methods mentioned may also apply to Chip-Seq data. Table of counts Gene ID A1 A2 B1 B2 ENSG00000124208 478 619 4830 7165 ENSG00000182463 27 20 48 55 ENSG00000125835 132 200 560 408 ENSG00000125834 42 60 131 99 ENSG00000197818 21 29 52 44 ENSG00000125831 ENSG00000215443 4 9 7 ENSG00000222008 30 23 ENSG00000101444 46 63 54 53 ENSG00000101333 2256 2793 2702 2976 … … tens of thousands more …
A tale of two experiments…
A tale of two experiments… outlier
A tale of two experiments… outlier outlier
A tale of two experiments… outlier Remove samples 16% expt 1 30% data in expt 2 outlier? outlier
Play the weighting game High precision Higher weight Low precision Lower weight More variable observations given lower weight in differential expression analysis
Voom: Mean-Variance trend Observational level weights – deal with trend in variability observed as abundance changes. Available in the limma package from Bioconductor Law et al. Genome Biology (2014) Sqrt( standard deviation) Lowess line mean (log2 cpm)
Sample-specific weights Assume: RNA-seq experiment has some replication How well does each sample agree with the others ? 1. For each gene: Average(X) Deviation = X – Average(X) 2. Average the Deviations for each sample 3. Penalise samples for disagreeing with others Ritchie et al. BMC Bioinformatics (2006)
Modified algorithm to allow block/group structure e.g. weights estimated separately for single low precision sample and remaining samples that are assigned higher (equal) weights
Simulating data with different fold-changes
Simulating data with outlier samples Key: Group 1: 1 2 3 Group 2: 4 5 6 Outlier: 6
Weighted analyses lead to fewer false discoveries Take the top 200 genes from each simulation and tally up the number of Key: 1. Voom weights 2. No weights 3. Samples weights 4. Block weights 5. Remove outlier
Weights improve power to detect differential expression Key: 1. Voom weights 2. No weights 3. Samples weights 4. Block weights 5. Remove outlier
Better rankings for known genes regulated by Smchd1 Protocadherins Voom Sample Weights Block Remove Outlier Experiment 1 0.00135 0.00021 0.00008 0.000175 Experiment 2 0.0581 0.00614 0.0235 0.0707 Structural Maintenance of Chromosomes Hinge Domain containing 1 (Smchd1) Has a role in X inactivation and in regulating monoallelic gene expression. Genomic imprinting and regulation of the clustered protocadherins Gene set testing using ROAST Wu et al. Bioinformatics (2010) Mould et al. Epigenetics & Chromatin (2013) Gendrel et al. MCB (2013)
Better rankings for known genes regulated by Smchd1 Imprinted genes Voom Sample Weights Block Remove Outlier Experiment 1 0.0231 0.000105 0.000115 0.00067 Experiment 2 0.0826 0.00817 0.0232 0.0355 Gene set testing using ROAST Wu et al. Bioinformatics (2010) Mould et al. Epigenetics & Chromatin (2013) Gendrel et al. MCB (2013)
Summary: why weight? Simulations show that combining voom with sample-level weights gives better results in terms of lowest numbers of false positives and improved power Better results also obtained on real RNA-seq data We allow flexibility in how the sample weights are assigned – sample-specific by default. Modelling of block/group-specific structure also possible voomWithQualityWeights() The limma package can incorporate weights at each stage of the analysis (differential expression and gene set testing) to deliver a more powerful analysis
Other Applications Human tumor samples Single cell RNA-seq
Acknowledgments Aliaksei Holik Marie-Liesse Asselin-Labat Stephen Wilcox Shalin Naik Jessica Tran Ben Kile Catherine Carmichael QIMR-Berghofer Graham Kay Funding Cynthia Liu Shian Su Andy Chen Gordon Smyth Toby Sargeant Natasha Jansz Kelan Chen Darcy Moore Jamie Gearing Huei San Leong Marnie Blewitt
A data hack for medical research problems. The weekend brings together software developers, user experience designers, data analysts and visualisers working directly with researchers to create better analysis tools. http://www.healthhack.com.au #healthhack http://au.okfn.org