Analysing ChIP-Seq Data

Slides:

Advertisements

Similar presentations

Visualising and Exploring BS-Seq Data

Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

Simon v2.3 RNA-Seq Analysis Simon v2.3.

Peter Tsai Bioinformatics Institute, University of Auckland

Section 4.2 Fitting Curves and Surfaces by Least Squares.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.

SeqMonk tools for methylation analysis

Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.

RNAseq analyses -- methods

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

RNA-Seq Analysis Simon V4.1.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.

1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:

Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.

Introduction to RNAseq

The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.

Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.

Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.

D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.

No reference available

A Framework and Methods for Characterizing Uncertainty in Geologic Maps Donald A. Keefer Illinois State Geological Survey.

HOMER – a one stop shop for ChIP-Seq analysis

Library QA & QC Day 1, Video 3

Plot type specific considerations

Misleading bioinformatics: Mistakes, Biases, Mis-interpretations and how to avoid them Festival of Genomics 2017 Course Exercise Material:

Differential Methylation Analysis

Simon v RNA-Seq Analysis Simon v

SeqMonk tools for methylation analysis

SeqMonk tools for methylation analysis

Simon v1.0 Motif Searching Simon v1.0.

What is a Hidden Markov Model?

Biases and their Effect on Biological Interpretation

Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.

RNA-Seq analysis in R (Bioconductor)

Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

Understanding and Validating Experimental Expectations

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Genome Wide Association Studies using SNP

The FASTQ format and quality control

Elementary Statistics

Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)

Artefacts and Biases in Gene Set Analysis

CHAPTER 4 Designing Studies

Simon v ChIP-Seq Analysis Simon v

Stat 217 – Day 28 Review Stat 217.

ChIP-Seq Data Processing and QC

Exploring and Understanding ChIP-Seq data

Misleading Bioinformatics Mistakes, Biases, Mis-Interpretations and how to avoid them Festival of Genomics 2017.

Visualising and Exploring BS-Seq Data

Sampling and Sample Size Calculations

CHAPTER 4 Designing Studies

Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin Cell Reports

CHAPTER 4 Designing Studies

Simon V Motif Searching Simon V

Artefacts and Biases in Gene Set Analysis

CHAPTER 4 Designing Studies

Exploring and Presenting Results

Molecular Convergence of Neurodevelopmental Disorders

CHAPTER 4 Designing Studies

CHAPTER 4 Designing Studies

Applying principles of computer science in a biological context

Fall Final Topics by “Notecard”.

Quantitative analyses using RNA-seq data

CHAPTER 4 Designing Studies

Differential Expression of RNA-Seq Data

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Analysing ChIP-Seq Data Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2018-03

Data Creation and Processing Starting DNA Fragmented DNA ChIPped DNA Mapped BAM File FastQ Sequence File Sequence Library Filtered BAM File Exploration Analysis

Steps in Analysis Define enriched regions Quantitate Compare Based around features De-novo peak prediction Quantitate Corrections and Normalisation Compare Categorical Quantitative

Defining Regions - Should I peak call? You need a single set of reference positions for analysis Peak calling to define solely from the data Feature based measurements if your exploration showed linkage to features If exploration showed strong and reasonably complete feature association then this is a good option No worries about missing weaker peaks More complete background (get both enriched and unenriched) If no feature linkage then peak call Only looking at enriched regions More difficult to do functional interpretation later on

How Peak Callers Work (MACS) Optimise the starting data Build a background model Test sliding windows Report Apply per-site adjustment

Optimise the starting data Correct the for/rev offset Deduplicate

Build a background model Lambda value Observed

Build a background model Lambda value Critical p-value (n=18) Model

Build a background model Lambda value Critical p-value (n=18) Observed + Model

Test Sliding Windows Generally use half of the library fragment size Windows whose count exceeds the critical value are kept Merge adjacent windows over the critical value to form peaks Generates candidate (not final) peak set

Correct for local variation Critical value Generate localised model if input density is higher than the global value Most pessimistic p-value is kept

How should you apply peak callers Multiple ChIPs (over multiple conditions) Multiple Inputs

Multiple Inputs Input variability generally reflects general trends Mappability Genome Assembly Fragmentation biases Normally best to merge all inputs to one common reference input

Multiple ChIPs BAM Files Peak Sets WT ChIP 1 WT ChIP 2 KO ChIP 1 + WT ChIP 2 KO ChIP 1 KO ChIP 2 Peaks WT ChIP 1 + WT ChIP 2 KO ChIP 1 KO ChIP 2

Multiple ChIPs BAM Files Peak Sets WT ChIP 1 WT ChIP 2 KO ChIP 1 WT Peaks 1 WT Peaks 1 And WT Peaks 2 KO Peaks 1 KO Peaks 2 WT Peaks 1 And WT Peaks 2 Or KO Peaks 1 KO Peaks 2 WT Peaks 2 KO Peaks 1 KO Peaks 2

Multiple ChIPs WT Rep1 WT Rep2 WT Rep3 WT All KO Rep1 KO Rep1 KO Rep1 KO All Peaks All Final Peaks (dedup)

Why isn't a peak called Calling a peak is a combination of Degree of enrichment Behaviour of the background Total number of sequences

Why isn't a peak called Fewer peaks are called by just sub-sampling the same data

Why isn't a peak called With no input the region around the peak is used to model the background

Combining Peak sets Don’t make claims based solely on the number of peaks (“there were more WT peaks than KO peaks” for example) Don’t make claims based on regions being peaks in 1 set but not another (there were 465 peaks which were specific to KO) It is OK to make statements about overlap (there were 794 peaks which were common to WT and KO) You have to address differential enrichment problems quantitatively

Quantitating ChIP data for analysis Quantitation of ChIP is not a simple problem Can start with something simple but in many cases you will need to refine this. Simple linear, globally corrected counts are a good place to start

Cleaning and Evaluating Filtering on input As with the exploration we should remove measures with unusually high input signal Easy if all measurements are the same size For different size measures (eg gene bodies) make sure you calculate read density (not just count) so values are comparable

Cleaning and Evaluating Filtering on peak size Many ChIPs will produce characteristically sized peaks Look for peaks which fall outside the expected range ~1kb

Normalising to input? Do you have significant variability in your input If not then there's no need to normalise Check it's not just related to different peak sizes! Do you have ChIP signal which is correlated with the input level? Only if you see correlation between ChIP and input is normalisation to be considered Most of the time this isn't the case

See if the input has an influence For truly enriched regions the input level is not predictive of the ChIP level. Normalising to input would make things worse here.

Why not always do "fold over input"? Inputs are generally poorly measured Coverage over enriched areas will be low in the input Fold values more influenced by input than ChIP Biases in input are smaller than enrichment power of the antibody

Evaluating and Normalising Enrichment Normalised Read Count Percentile through data

Look for systematic enrichment changes (real biology!!)

Normalising Enrichment Simple Single point of reference (percentile, size factor etc) Works for small differences, not for large ones Enrichment specific Two points of reference Low percentile to reflect baseline High percentile to reflect close to saturation Add to match first, Multiply to match second Quick and Dirty Quantile normalisaiton to force a common distribution Don't normalise the input or use it to calculate distributions!

Normalising Enrichment

Checking Normalisation Before Normalisation After Normalisation

Differential enrichment analysis Needs to be quantitative Needs to operate on non-deduplicated data Two statistical options Count based stats on raw uncorrected counts DESeq EdgeR Continuous quantiation stats on normalised enrichment values LIMMA

Which statistic to pick? If baselines do not look obviously separated Raw counts, then DESeq/EdgeR If baselines have separated Enrichment normalisation LIMMA statistics

Validation of hits Map onto scatterplot for simple verification Check properties of peaks Sizes Strand bias Genomic location Look at the data underneath candidates you make specific claims about

Hit validation Linear View Log View Look whether hits make sense Look at points which change but were not selected Log scale can be useful for visualising hits Keep the context of non-hits

Hit validation Directionality Most ChIP enrichments are not strand-specific Should expect to see enrichment on both strands

Hit validation Hits shouldn't look different in unrelated metrics Strand Bias Plot Variance Plot Hits shouldn't look different in unrelated metrics Directionality Variance (within replicate group)

Hit validation You should be able to see consistency between replicates

Downstream Analyses

Composition / Motif Analysis Good place to start, can provide either biological or technical insight See if hits (up vs down) cluster based on the underlying sequence composition Motifs Great for defining putative binding sites Interesting to do sensitivity check Can do differential motif calling (for hit/non-hit)

Compter - composition analysis www.bioinformatics.babraham.ac.uk/projects/compter

MEME - Motif Analysis

Gene Ontology / Pathway Be careful how you relate hits to genes Really need to have a global link between peak positions and genes Random positions will give significant GO hits if you just use closest/overlapping genes

Experimental Design

Experimental Design Considerations All normal rules apply Think about sources of variation Don't confound variables Think about what batch effects might exist Test your antibody well before starting By far the biggest factor in success Good performance on Western / in-situ is not a guarantee, but it's a good start

Experimental Design Considerations Number of replicates Lots of studies use 2 replicates Fine for just finding binding sites (motif analysis) Not really enough for differential binding Huge reliance on 'information sharing' No accurate measurement of variance per peak Potentially over-predicts differential binding Should think about likely levels of variability and make replicates to match

Experimental Design Considerations Amount of sequencing Can be difficult to predict Depends on Genome size Proportion of genome which is enriched Efficiency of enrichment ENCODE standard is ~20M reads per sample Can get away with fewer (K4me3 for example) Will need more for some marks (H3 for example) Sequencing depth will affect ability to detect changes

Experimental Design Considerations Type of sequencing Single end is find for most applications ATAC-Seq can require paired end for some analyses Moderate read length is required Can map anywhere in the genome 50bp is probably OK. 100bp would be preferable

Material for the Course All Slides Exercises Data Virtual Machine Images Are available at http://www.bioinformatics.babraham.ac.uk/training.html