Why weight? Variance modelling for designed RNA-seq experiments

Why weight? Variance modelling for designed RNA-seq experiments
Abstract: Outlier samples are relatively common in RNA-seq experiments and the root cause of such variation is generally unknown. In small experiments, the analyst is left with the difficult decision of what to do: removing the offending sample may reduce variation, but at a cost of reducing power, which can limit our ability to detect biologically meaningful changes. A compromise is to use all of the available data, but to down-weight the observations from the outlier sample in the analysis. In this poster we describe a statistical approach that allows this by modelling heterogeneity at both the sample and observational level in the differential expression analysis. Using both simulations and real data, we tease apart scenarios where this strategy leads to a more powerful analysis. Our approach is implemented in the open-source limma package available from Bioconductor ( Matthew Ritchie Walter + Eliza Hall Institute ABiC, 11th October 2014

RNA-seq in genomics research
Use of high-throughput sequencing (‘next-’ or ‘second’-generation) technologies to study gene expression Many applications: differential expression, transcript discovery, alternative splicing, allele-specific expression Focus today is on adapting methods for differential expression analysis to work well on messy data sets RNA-Sequencing is becoming increasingly popular for the study of differentially gene expression. The RNA-Seq technology sequences short reads and aligns them back to the genome. The number of reads which map to a particular gene, exon, or some other feature, is recorded giving data in the form of counts. Chip-seq data also comes in the form of counts, and although I’ll be talking about RNA-Seq data, the methods mentioned may also apply to Chip-Seq data. Table of counts Gene ID A1 A2 B1 B2 ENSG 478 619 4830 7165 ENSG 27 20 48 55 ENSG 132 200 560 408 ENSG 42 60 131 99 ENSG 21 29 52 44 ENSG ENSG 4 9 7 ENSG 30 23 ENSG 46 63 54 53 ENSG 2256 2793 2702 2976 … … tens of thousands more …

A tale of two experiments…

outlier

outlier outlier

outlier Remove samples 16% expt 1 30% data in expt 2 outlier? outlier

Play the weighting game
High precision Higher weight Low precision Lower weight More variable observations given lower weight in differential expression analysis

Voom: Mean-Variance trend
Observational level weights – deal with trend in variability observed as abundance changes. Available in the limma package from Bioconductor Law et al. Genome Biology (2014) Sqrt( standard deviation) Lowess line mean (log2 cpm)

Sample-specific weights
Assume: RNA-seq experiment has some replication How well does each sample agree with the others ? 1. For each gene: Average(X) Deviation = X – Average(X) 2. Average the Deviations for each sample 3. Penalise samples for disagreeing with others Ritchie et al. BMC Bioinformatics (2006)

Modified algorithm to allow block/group structure
e.g. weights estimated separately for single low precision sample and remaining samples that are assigned higher (equal) weights

Simulating data with different fold-changes

Simulating data with outlier samples
Key: Group 1: 1 2 3 Group 2: 4 5 6 Outlier: 6

Weighted analyses lead to fewer false discoveries
Take the top 200 genes from each simulation and tally up the number of Key: 1. Voom weights 2. No weights 3. Samples weights 4. Block weights 5. Remove outlier

Weights improve power to detect differential expression
Key: 1. Voom weights 2. No weights 3. Samples weights 4. Block weights 5. Remove outlier

Better rankings for known genes regulated by Smchd1
Protocadherins Voom Sample Weights Block Remove Outlier Experiment 1 Experiment 2 0.0581 0.0235 0.0707 Structural Maintenance of Chromosomes Hinge Domain containing 1 (Smchd1) Has a role in X inactivation and in regulating monoallelic gene expression. Genomic imprinting and regulation of the clustered protocadherins Gene set testing using ROAST Wu et al. Bioinformatics (2010) Mould et al. Epigenetics & Chromatin (2013) Gendrel et al. MCB (2013)

Better rankings for known genes regulated by Smchd1
Imprinted genes Voom Sample Weights Block Remove Outlier Experiment 1 0.0231 Experiment 2 0.0826 0.0232 0.0355 Gene set testing using ROAST Wu et al. Bioinformatics (2010) Mould et al. Epigenetics & Chromatin (2013) Gendrel et al. MCB (2013)

Summary: why weight? Simulations show that combining voom with sample-level weights gives better results in terms of lowest numbers of false positives and improved power Better results also obtained on real RNA-seq data We allow flexibility in how the sample weights are assigned – sample-specific by default. Modelling of block/group-specific structure also possible voomWithQualityWeights() The limma package can incorporate weights at each stage of the analysis (differential expression and gene set testing) to deliver a more powerful analysis

Other Applications Human tumor samples Single cell RNA-seq

Acknowledgments Aliaksei Holik Marie-Liesse Asselin-Labat
Stephen Wilcox Shalin Naik Jessica Tran Ben Kile Catherine Carmichael QIMR-Berghofer Graham Kay Funding Cynthia Liu Shian Su Andy Chen Gordon Smyth Toby Sargeant Natasha Jansz Kelan Chen Darcy Moore Jamie Gearing Huei San Leong Marnie Blewitt

A data hack for medical research problems.
The weekend brings together software developers, user experience designers, data analysts and visualisers working directly with researchers to create better analysis tools. #healthhack

Why weight? Variance modelling for designed RNA-seq experiments

Similar presentations

Presentation on theme: "Why weight? Variance modelling for designed RNA-seq experiments"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why weight? Variance modelling for designed RNA-seq experiments

Similar presentations

Presentation on theme: "Why weight? Variance modelling for designed RNA-seq experiments"— Presentation transcript:

Similar presentations

About project

Feedback