Exploring and Understanding ChIP-Seq data

Slides:

Advertisements

Similar presentations

Visualising and Exploring BS-Seq Data

Advertisements

SeqMonk tools for methylation analysis

Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.

RNA-Seq Analysis Simon V4.1.

Presenting Results Laura Biggins v1.0 1.

28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.

No reference available

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

HOMER – a one stop shop for ChIP-Seq analysis

CHAPTER 10 DATA EXPLORATION 10.1 Data Exploration Box 10.1 Data Visualization Descriptive Statistics Box 10.2 Descriptive Statistics Graphs.

Plot type specific considerations

Descriptive Statistics ( )

Thursday, May 12, 2016 Report at 11:30 to Prairieview

Differential Methylation Analysis

Simon v RNA-Seq Analysis Simon v

SeqMonk tools for methylation analysis

SeqMonk tools for methylation analysis

Simon v1.0 Motif Searching Simon v1.0.

Chapter 3: Describing Relationships

Biases and their Effect on Biological Interpretation

Gene expression from RNA-Seq

Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

QC analysis Uppsala University Work done by Jonas Almlöf

Unit 4 Statistical Analysis Data Representations

Understanding and Validating Experimental Expectations

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Genome Wide Association Studies using SNP

The FASTQ format and quality control

Experimental Power Graphing Program

Inverse Transformation Scale Experimental Power Graphing

Analysing ChIP-Seq Data

Figure 1. Complete work-flow of the Scasat

Artefacts and Biases in Gene Set Analysis

CHAPTER 29: Multiple Regression*

AP Exam Review Chapters 1-10

CHAPTER 26: Inference for Regression

Topic 5: Exploring Quantitative data

Simon v ChIP-Seq Analysis Simon v

Chapter 2 Looking at Data— Relationships

ChIP-Seq Data Processing and QC

AP Statistics, Section 3.3, Part 1

Identification and Characterization of pre-miRNA Candidates in the C

Misleading Bioinformatics Mistakes, Biases, Mis-Interpretations and how to avoid them Festival of Genomics 2017.

Visualising and Exploring BS-Seq Data

Objectives (IPS Chapter 2.3)

Eric Samorodnitsky, Jharna Datta, Benjamin M

Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.

CHAPTER 3 Describing Relationships

CHAPTER 3 Describing Relationships

Ying-Ying Yu, Ph. D. , Cui-Xiang Sun, Ph. D. , Yin-Kun Liu, Ph. D

CHAPTER 3 Describing Relationships

CHAPTER 3 Describing Relationships

Simon V Motif Searching Simon V

Inferential Statistics

Artefacts and Biases in Gene Set Analysis

Honors Statistics Review Chapters 4 - 5

CHAPTER 3 Describing Relationships

CHAPTER 3 Describing Relationships

Exploring and Presenting Results

Molecular Convergence of Neurodevelopmental Disorders

CHAPTER 3 Describing Relationships

Volume 26, Issue 5, Pages (March 2016)

CHAPTER 3 Describing Relationships

Association between 2 variables

Chapter 13 Multiple Regression

CHAPTER 3 Describing Relationships

Higher National Certificate in Engineering

CHAPTER 3 Describing Relationships

Model Adequacy Checking

CHAPTER 3 Describing Relationships

Presentation transcript:

Exploring and Understanding ChIP-Seq data Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2018-03-02

Data Creation and Processing Starting DNA Fragmented DNA ChIPped DNA Mapped BAM File FastQ Sequence File Sequence Library Filtered BAM File Exploration

Some Basic Questions Is there any enrichment? What is the size / patterning of enrichment? How well are my controls behaving? What is the best way to quantitate this data? Are there any technical artefacts?

Start with a visual inspection Is there any enrichment? What is the size / patterning of enrichment? How well are my controls behaving?

Start with a visual inspection Is there any enrichment? What is the size / patterning of enrichment? How well are my controls behaving?

Extending reads if necessary Peak Width For point enrichment, insert size is roughly peak width/2

Look for peaks Associate with features TSS Are my peaks narrow or broad Do peak positions obviously correspond to existing features?

Examine Controls IgG or other Mock IP Good result is no material at all Not worth sequencing. Reads are only informative if the ChIP hasn't worked. Input material (sonicated / Mnase etc) Genomic library - everywhere equally Technical issues can cause variation

Examine Controls Does the coverage look even If there are multiple inputs to do they look similar

Examine Controls

Why do controls misbehave? Low coverage Repetitive unmappable regions Holes in the assembly High coverage Mismapped reads from outside the assembly Biases DNA stability GC content Segmental Duplication

Types of input problem Categorical Quantitative Mismapped reads Indicates that the region can't be trusted Blacklist and remove - don't try to correct Quantitative Other biases (most often GC) Some potential to correct, but difficult Hopefully consistent, so will cancel out

Making Blacklists Unusual Coverage Pre-built lists Outlier detection (boxplots etc.) Often only filter over-representation (maybe also zero counts) Pre-built lists Large projects often build these (ENCODE) Not for all species

Pre-Existing Blacklists ENCODE modENCODE UCSC Check assembly versions https://sites.google.com/site/anshulkundaje/projects/blacklists

Comparison of samples

Initial Quantitation Always start with a simple unbiased quantitation (not focussed on features/peaks) Tiled measures over the whole genome Use approximate insert size as window size Something around 500bp is normally sensible Linear read count quantitation corrected for total library size

Quantitation Questions Initially similar to what we did with the raw data, but with some values behind it Can we observe enrichment in our ChIP above our controls Do we see similar enrichment between replicates, or between conditions

Compare samples Visual comparison against raw data Similar apparent overall enrichment Any obvious differences

Compare samples Scatterplot input vs ChIP Raw Filtered Any relationship between input and ChIP What proportion enriched How enriched over input

Compare samples Scatterplot input vs input Any suggestion of differential biases in inputs Can we merge them to use as a common input

Compare samples Scatterplot ChIP vs ChIP Look at examples for different Parts of the plot Look for outgroups (differentially enriched) Compare level of enrichment (compare to diagonal)

Compare samples Higher level clustering Correlation Matrix Correlation Tree PCA Plot tSNE Plot

Compare samples Summarise distributions Cumulative Distribution Plot QQ Plot Flatness of input Separation of ChIPs Consistency of ChIPs

Associate enrichment with features

Trend Plots Graphical way to look at overall enrichment relative to positions in features Gene bodies Promoters CpG islands May influence how we later quantitate and analyse the data Analyse per feature Look for exceptions to the general rule

Trend Plot Example Overall average Says nothing about proportion of features affected

Trend plots should match the data

Aligned Probes Plots give more detail Information per feature instance Comparison of equivalent features in different marks/samples

After exploration you should.. Know whether your ChIP is really enriched Know the nature / shape of the enrichment Know whether your controls behave well Know whether you're likely to have differential enrichment Know if you will need additional normalisation Know the best strategy to measure your data