Artefacts and Biases in Gene Set Analysis

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

RNA-Seq as a Discovery Tool
RIP – T RANSCRIPT E XPRESSION L EVELS. O UTLINE RNA Immuno-Precipitation (RIP) NGS on RIP & its alternatives Alternate splicing Transcription as a graph.
Visualising and Exploring BS-Seq Data
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Investigating the Importance of non-coding transcripts.
More regulating gene expression. Fig 16.1 Gene Expression is controlled at all of these steps: DNA packaging Transcription RNA processing and transport.
AP Biology Control of Eukaryotic Genes.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Regulation of Gene Expression Eukaryotes
RNA-Seq Analysis Simon V4.1.
Verna Vu & Timothy Abreo
Presenting Results Laura Biggins v1.0 1.
AP Biology Control of Eukaryotic Genes.
AP Biology Control of Eukaryotic Genes.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
No reference available
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Extracting Biological Information from Gene Lists
Differential Methylation Analysis
Simon v RNA-Seq Analysis Simon v
(3) Gene Expression Gene Expression (A) What is Gene Expression?
Pathway Informatics 16th August, 2017
High-throughput data used in bioinformatics
Simon v1.0 Motif Searching Simon v1.0.
Biases and their Effect on Biological Interpretation
Controlling the genes Lecture 15 pp
Gene expression from RNA-Seq
Control of Gene Expression
RNA-Seq analysis in R (Bioconductor)
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Detect alternative splicing
Understanding and Validating Experimental Expectations
The Basics of Molecular Biology
Primer design.
The FASTQ format and quality control
2/23/15 Learning Objectives
Exploratory Allele Specific Expression Analysis on X chromosome
Analysing ChIP-Seq Data
The Omics Dashboard Suzanne Paley Pathway Tools Workshop 2018
Artefacts and Biases in Gene Set Analysis
Behaviorally dependent allele-specific expression.
Characterization of microRNA transcriptome in tumor, adjacent, and normal tissues of lung squamous cell carcinoma  Jun Wang, MD, PhD, Zhi Li, MD, PhD,
Eukaryote Regulation and Gene Expression
Altered microRNA expression in stenoses of native arteriovenous fistulas in hemodialysis patients  Lei Lv, MD, Weibin Huang, MD, Jiwei Zhang, MD, Yaxue.
Exploring and Understanding ChIP-Seq data
Proteomics Informatics David Fenyő
Corona cell RNA sequencing from individual oocytes revealed transcripts and pathways linked to euploid oocyte competence and live birth  Jason C. Parks,
Control of Eukaryotic Genes
Visualising and Exploring BS-Seq Data
Protein Occupancy Landscape of a Bacterial Genome
What is the “big picture” question?
Gene expression and regulation & Mutations
Simon V Motif Searching Simon V
Working with RNA-Seq Data
MRNAs that show increased protein synthesis following UPF1 depletion are enriched for those with very long 3′UTRs. mRNAs that show increased protein synthesis.
Gene Expression Analysis
Differential expression levels of archaella operon genes and respiratory chain genes in Δsaci_pp2a and Δsaci_ptp. Differential expression levels of archaella.
Identification of the Transcriptional Targets of FOXP2, a Gene Linked to Speech and Language, in Developing Human Brain  Elizabeth Spiteri, Genevieve.
The Omics Dashboard.
Predicting Gene Expression from Sequence
Experimental summary Norwich DMSO treated versus Control Yeast
Cell protein production
Comparison of N. gonorrhoeae gene expression in infected men in vivo and isolates grown in vitro. Comparison of N. gonorrhoeae gene expression in infected.
Sequence Analysis - RNA-Seq 1
Data Type 1: Microarrays
Differential Expression of RNA-Seq Data
Presentation transcript:

Artefacts and Biases in Gene Set Analysis Simon Andrews, Laura Biggins, Christel Krueger simon.andrews@babraham.ac.uk laura.biggins@babraham.ac.uk christel.krueger@babraham.ac.uk v2019-02

What does gene set enrichment test? Is a functional gene set enriched for genes in my hit list compared to a background set Are some genes more likely to turn up in the hits for technical reasons? Are some genes never likely to turn up in the hit list for technical reasons?

Biases All datasets contain biases Technical Biological Statistical Biases can lead to incorrect conclusions We should be trying to spot these Some are more obvious than others!

Technical Biases Simple GC bias from different polymerases in PCR

Statistical Biases The power to detect a significant effect is based on: How big the change is How well observed the data is (sample size) Lists of hits are often biased based on statistical power

RNA-Seq Statistical Biases What determines whether a gene is identified as significantly differentially regulated? The amount of change (fold change) The variability How well observed was it How much sequencing was done overall? How highly expressed was the gene? How long was the gene? How mappable was the gene?

RNA-Seq Statistical Biases Unlikely to ever see hits from genes which are Lowly expressed Short

Biological Biases

Biases Look Like Real Biology

What can you do? Think about whether you’re likely to have expected biases in your experiment. If possible, restructure to avoid the bias Look for unexpected biases. Sometimes the bias is the interesting biology Use custom backgrounds during Gene Set Analysis to help minimise bias.

Correct selection of a background list can make a huge difference What genes were you likely to see? Some are technically impossible Membrane proteins in LC-MS Small-RNA in RNA-Seq Some are much less likely Unexpressed or low expressed in RNA-Seq Unmappable in ChIP-Seq Low CpG content in BS-Seq Make a list of what you could have seen, and set that as the background.

Expressed Genes Log2 Read Counts per Gene 26,127 Genes Measured

Expressed Genes 10,378 Genes Realistically Measured <128 (2^7) reads per gene Eye Development (p 8e-4) Log2 Read Counts per Gene 10,378 Genes Realistically Measured

Statistical biases affect gene sets too Fisher’s test is powered by Magnitude of change Observation level Big lists have more power to detect change Small lists are very difficult to detect Always look at the enrichment and the p-value when deciding what is interesting

Fold Change and p-value

Other biases: Relating Hits to Genes Most functional analysis is done at the gene level Gene Ontology Pathways Interactions Many hits are not gene based

Other biases: Random Genomic Positions Find closest gene Synapse, Cell Junction, postsynaptic membrane (p=8.9e-12) Membrane (p=4.3e-13) Glycoprotein (p=1.3e-12) Find overlapping genes Plekstrin homology domain (p=1.8e-7) Ion transport (p=7.1e-7) ATP-binding (p=3.8e-8)

Other biases: Random Transcripts Tends to favour genes with more splice variants Metal Binding, Zinc Finger (p=4.4e-12) Nucleus, Transcription Regulation (p=2.4e-14)

Stuff which turns up more than it should… Did a trawl through GEO RNA-Seq datasets Downloaded pairs of samples which are supposed to be biological replicates Found changing genes Ran GO searches Many gene sets give hits. Some categories turn up very often Ribosomal Cytoskeleton Extracellular Secreted Translation

Hit Validation Do my hits look different from non-hits in factors which should be unrelated Sequence composition Genomic position Gene Length Number of splice variants etc If a bias exists then is this the actual link between genes? If not then can I fix this by improving my background list?

Check for unrelated factors http://www.bioinformatics.babraham.ac.uk/shiny/gene_screen/

Check for unrelated factors Compter Sequence kmer analysis Does composition explain my hits? www.bioinformatics.babraham.ac.uk/projects/compter