Download presentation
1
Analyzing ChIP-seq data
Wing-Kin Sung National University of Singapore
2
Transcriptional Control (I)
3
Transcriptional Control (II)
4
Protein-DNA binding sites
Binding sites usually consist of 5-12 bases (upto 30 bp) Binding site sequence preferences of protein factors is not exact. It may be represented as a weight matrix AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA Transcription factor Binding site
5
Question Can we identify where the transcription factors bind on the genome? Can we identify the binding motifs of the transcription factors?
6
Technology: ChIP experiment
Chromatin immunoprecipitation experiment Detect the interaction between protein (transcription factor) and DNA.
7
Technology: ChIP-seq Sonication + ChIP
ChIP-sequencing + mapping to reference genome Noise Peak detection
8
ChIP-seq data Tag Mapping Peak calling (CCAT) Motif scanning (CentDist)
9
CCAT: A peak finding method
10
ChIP-seq peak finders ChIP-Seq is becoming the main stream for genome-wide study of protein-DNA interactions, histone modifications and DNA methylation patterns. Many tools have been proposed for ChIP-Seq analysis (e.g., PeakFinder, MACs, SISSRs, PeakSeq, CisGenome)
11
Aim Contribution of CCAT: Aim:
How to estimate noise in a ChIP-seq library? How to perform a more correct FDR estimation? Aim: Hope to show that CCAT can identify weak binding sites which cannot be discovered by existing methods.
12
ChIP-seq model (Linear signal-noise model)
Binding regions Our sample library:
13
How to identify binding sites with the help of control library?
Our sample library (N=27): Control library (M=14): Sample library has 3 fold more reads. Hence, we predict this is a binding site.
14
What happen if we cannot correctly estimate the noise?
Our sample library (N=27): Control library (M=28): When control library has almost the same size as the sample library! Fail to identify this binding site.
15
How to estimate noise? (I)
If we know the list of background regions R, the noise can be estimated as Our sample library (N=27): Control library (M=28): In this example, we estimate = 7/14 x 28/27.
16
How to estimate noise? (II)
Given some initial guess of , we can predict the list of background regions R by Our sample library (N=27): Control library (M=28): In this example, if = 1, predicted background regions are regions with #sample_reads < 27/28 #ctrl_reads.
17
How to estimate noise? (III)
Input: ChIP library and control library Set = 1; Iterate until is stablized Estimate the background regions Predict from the regions R;
18
Spike-in Simulation Spike-in dataset generated from:
Strategy for generating Nanog spike-in dataset: Determine spike-in region from Nanog1, and retrieve spike-in reads from Nanog2; Background noise in ChIP library come from control1, and noise in control library come from control2; Two spike-in datasets: Nanog and H3K4me3 library ID antibody # of uniquely mapped reads* reference control 1 GFP 3.83M Chen et. al., 2008, Cell control 2 WCE 6.76M Unpublished Nanog 1 Nanog 6.03M Marson et. al., 2008, Cell Nanog 2 8.42M H3K4me3 1 H3K4me3 6.94M H3K4me3 2 8.85M Mikkelson et. al., 2007, Nature
19
Spike-in Simulation Convergency is fast! The noise rate coverge in
about 5 iterations! The noise rate estimation is accurate. Relative error < 5%!
20
FDR estimation Given a list of candidate sites ranked by some scoring function, our aim is to determine the cutoff threshold such that FDR<0.05; If the threshold is too loose, We get more noise. If threshold is too strengent, We miss the weak peaks. To identify the weak peaks, we need an accurate FDR estimation
21
Methods for estimating FDR
A number of methods for determine the cutoff. Bionomial p-value, e.g., Benjamini-Hochberg (B-H) correction by (Benjamini & Hochberg, 1995; Rozowsky et al., 2009) Storey’s method by (Storey, 2002; Nix et al., 2008) Empirical p-value, e.g., eFDR by (Nix et al.,2008) Library swapping proposed by (Zhang et al., 2008)
22
Is binomial p-value good?
Observed background variation is different from the estimation from the binomial model. Reason: The wet lab noise is not uniformly distributed in the genome. Binomial p-value is not good enough!
23
Library swapping N reads from ChIP library
N reads from control library N sample reads N control reads N sample reads N control reads ChIP sites Control sites Determine empirical cutoff
24
More on library swapping
Library swapping works well for most cases. However, as mentioned by Zhang et al., the estimated FDR would be biased for some cases. We found that the bias is due to the fact that they did not consider the noise rate.
25
Modified library swapping
N reads from ChIP library N reads from control library N sample reads N control reads N sample reads N control reads ChIP sites Control sites Determine empirical cutoff
26
Spike-in Simulation FDR estimation for Nanog
FDR estimation for H3K4me3 Library swapping has the best FDR estimation!
27
Application to mESC H3K4me3 data
ChIP library: Mikkelson et. al., 2007, Science. Control library: Chen et. al., 2008, Cell. Normalized difference score (Nix et. al., 2008, BMC Bioinfo.) Distinct chromatin features associated with strong and weak H3K4me3 sites. FDR CCAT: 0.02 PeakSeq: 0.05 qPCR validation
28
Application to mESC H3K36me3 data
Comparison of 8176 novel regions to RefSeq, Ensembl, and MGC gene annotation.
29
Motif scanning for ChIP-seq data
30
Advantages of ChIP-seq
ChIP-seq allows us to precisely map global binding sites for any TF with validated antibody. It offers two advantages: More candidate binding sites (known as peaks) Higher resolution (usually the main motif is located +/- 100bp from the peaks)
31
How to find motifs in ChIP-seq data?
Input: a set of peaks Select high intensity peaks. For every selected peak, extract the DNA sequence in, says, +/-200bp region from the peak. Perform motif finding on those selected DNA sequences.
32
Apply such approach on AR dataset in LNCaP cell-line
LNCaP cell-line DHT treated 2hr, ChIP-ed with AR antibody MACS reports binding sites Using 600 vertebrate PWMs (145 clusters) from TRANSFAC. Perform CEAS and Core-TF using top sites. Window size: 200, 400, 1000 For Core-TF: we try random background and promoter background.
33
Motif Scanning Result (top 20 results)
There are 7 known co-TFs of AR. CoreTF 200bp GATA CEBP NKX OCT Out of 7 known co-factors of AR, 5 of them are discovered by CoreTF and CEAS. ETS AR FOX NF1 CEAS 400bp
34
Detail of motif scanning result
CORE_TF prombg 200 CORE_TF prombg 400 CORE_TF prombg 1000 CORE_TF randbg 200 CORE_TF randbg 400 CORE_TF randbg 1000 CEAS 200 CEAS 400 CEAS 1000 AR 2 6 1 CEBP 12 16 25 20 15 7 ETS 64 61 66 37 47 FOX GATA 10 13 14 NF1 40 60 70 21 31 3 NKX 11 5 4 OCT 8 19 26 AP4 65 AUC 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625
35
ChIP-seq protocol revisit
From the empirical study of Qi et al.(2006), we know that the length of ChIP fragment follows a gamma distribution. sonication immunoprecipiation
36
ChIPed motif show center enrichment around AR peaks
Due to the ChIP-seq protocal, we expect the correct motif shows a center enrichment for the frequency graph. We assume noise like CG bias is uniformly distributed. If the motif is not real, its 1st derivative will be near zero. Below frequency graph shows that AR has center enrichment while the velocity graph shows that AR is not noise. AR motif distribution around AR peak Velocity distribution for AR motif
37
Co-motifs show center enrichment around peaks
Since co-regulating factors are expected to co-occur in close proximity, we expected co-motifs also show center enrichment around peaks. For example, NF1 is a known co-motif of AR. We observe center enrichment of NF1 motif around the AR peaks.
38
Center distribution score
We define a score function based on the frequency graph and the velocity graph. Features: We don’t require background model. We will learn the window size automatically We will learn the PWM score cutoff
39
Automatically learn the parameter of the frequency graph
V$AR_02
40
Non-co-motifs do not show center enrichment around peaks
Below two figures verify this.
41
CENTDIST workflow
42
CENTDIST Based on the center enrichment of the TFs relative to the peak, we derive a method CENTDIST. CENTDIST measures the center enrichment based on Z-score. Then, the ranked TFs are reported.
43
Can CENTDIST find known co-motifs of AR?
All known co-motifs of AR show good center enrichment. Note that although Oct1 motif does not show good enrichment around the peaks, Oct1 motif shows good enrichment for 1st and 2nd order derivative.
44
CENTDIST vs CEAS vs CORE_TF
CENTDIST CORE_TF prombg 200 CORE_TF prombg 400 CORE_TF prombg 1000 CORE_TF randbg 200 CORE_TF randbg 400 CORE_TF randbg 1000 CEAS 200 CEAS 400 CEAS 1000 AR 1 2 6 CEBP 14 12 16 25 20 15 7 ETS 9 64 61 66 37 47 FOX GATA 10 13 NF1 11 40 60 70 21 31 3 NKX 8 5 4 OCT 19 26 AP4 65 AUC 0.9683 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625
45
Can CENTDIST identify novel factor?
AP4 is rank 21 in CENTDIST. Core-TF and CEAS rank AP4 low, since AP4 is not highly enrich around the peaks.
46
Validation of AP4
47
Validation of AP4 To be unbiased, we make a AP4 ChIP-seq.
38% of AP4 peaks overlap with AR peaks. 62768 2296 3786 AR AP4
48
Validation of AP4 We also check the microarray expression.
The result suggests that AP4 may co-localize with AR to directly up-regulate the transcription of androgen target genes.
49
Validation using ChIP-seq from ES cell
CENTDIST performs better than CEAS and Core-TF for most cases. CENTDIST CEAS Core-TF Nanog 0.9647 0.7346 0.7549 Oct4 0.9133 0.825 0.7508 Sox2 0.9499 0.8765 0.6939 Stat3 0.9309 0.7492 0.7308 Smad1 0.8483 0.8803 0.7048 P300 0.9234 0.8098 0.719 KLF4 0.8432 0.6864 0.8015 ESRRB 0.8622 0.9744 0.9295 Cmyc 0.9776 0.8401 0.9237 Nmyc 0.9334 0.5235 0.9107 ZFX 0.9545 0.5373 0.9221 E2F1 0.9529 0.5349 0.9351 AVG AUC
50
p300 CENTDIST has potential to find enhancer factors using p300 ChIP-seq. Known cofactors CENTDIST CORE_TF CEAS Sox 5 3 8 Oct 1 2 6 Nanog 33 49 107 Stat 40 63 137 ERE CP2 4 91 12 E2F 18 116 7 AVG RANK 16 47
51
Discussion CentDist can find motifs which are marginally over-represented. CentDist can detect the window size CentDist doesn’t require background model
52
Acknowledgement Bioinformatics Sequencing Cancer Biology Guoliang Li
Pramila Charlie Lee Han Xu Fabi Kuan Hon Loh Chang Cheng Wei Gao Song Chandana Rikky Zhang Zhi Zhou Sequencing Wei Chialin Handoko Lusy Sequencing team Cancer Biology Edwin Cheung Pau You Fu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.